首页> 外文学位 >Large-scale clustering: Algorithms and applications.
【24h】

Large-scale clustering: Algorithms and applications.

机译:大规模聚类:算法和应用。

获取原文
获取原文并翻译 | 示例

摘要

Clustering is a central problem in unsupervised learning for discovering interesting patterns in the underlying data. Though there have been numerous studies on clustering methods, the focus of this dissertation is on developing efficient clustering algorithms for large-scale applications such as text mining, network analysis, image segmentation and bioinformatics.; We first present a time and memory efficient technique for the entire process of text clustering, including the creation of the vector space model for documents. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection.; Clustering algorithms which are based on heuristics can get trapped in inferior local optima therefore yielding qualitatively poor results. As the second part of our work, we propose the use of local search and annealing to improve the quality of the clustering results. In local search, we create a chain of incremental point moves that leads the objective function out of local optima; while the idea of annealing is that we enforce the perturbation of cluster centers after clusters become stablized. The effectiveness of these techniques is illustrated in text clustering and gene expression analysis.; Data in many domains, such as cluster analysis of the world wide web or circuit partitioning, is represented as graphs. Clustering is often used to find and analyze structural and functional properties of these graphs. In the last part of the dissertation, we present an efficient, high-quality multilevel kernel-based graph clustering algorithm, which outperforms previous state-of-the-art spectral methods in quality and runs hundreds or even thousands of times faster. Our multilevel graph clustering algorithm is based on a theoretical connection with the weighted kernel k-means clustering algorithm. We empirically demonstrate that our algorithm is efficient and effective on large social networks, protein interaction networks and image segmentation.
机译:聚类是在无监督学习中发现基础数据中有趣模式的核心问题。尽管关于聚类方法的研究很多,但本文的重点是为文本挖掘,网络分析,图像分割和生物信息学等大规模应用开发有效的聚类算法。我们首先为整个文本聚类过程(包括为文档创建向量空间模型)提出一种节省时间和内存的技术。通过(i)内存有效的多线程预处理方案和(ii)充分利用数据集稀疏性的快速聚类算法,可以获得这种效率。我们证明了整个过程花费的时间与文档集合的大小成线性关系。基于启发式算法的聚类算法可能会陷入较差的局部最优状态,从而导致定性较差的结果。作为我们工作的第二部分,我们建议使用局部搜索和退火来提高聚类结果的质量。在局部搜索中,我们创建了一系列增量点移动,将目标函数引出局部最优点。而退火的思想是在团簇稳定后,我们对团簇中心进行微扰。这些技术的有效性在文本聚类和基因表达分析中得到了说明。许多领域的数据(例如,万维网的群集分析或电路分区)均以图形表示。聚类通常用于查找和分析这些图的结构和功能特性。在本文的最后一部分,我们提出了一种高效,高质量的基于内核的多级图聚类算法,该算法在质量上优于以前的最新光谱方法,并且运行速度快了数百甚至数千倍。我们的多级图聚类算法基于与加权内核k均值聚类算法的理论联系。我们从经验上证明我们的算法在大型社交网络,蛋白质相互作用网络和图像分割方面是有效的。

著录项

  • 作者

    Guan, Yuqiang.;

  • 作者单位

    The University of Texas at Austin.;

  • 授予单位 The University of Texas at Austin.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 174 p.
  • 总页数 174
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号