首页> 外文学位 >Relationship-based clustering and cluster ensembles for high-dimensional data mining.
【24h】

Relationship-based clustering and cluster ensembles for high-dimensional data mining.

机译:用于高维数据挖掘的基于关系的聚类和聚类集成。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation takes a relationship-based approach to cluster analysis of high (1000 and more) dimensional data that side-steps the ‘curse of dimensionality’ issue by working in a suitable similarity space instead of the original feature space. We propose two frameworks that leverage graph algorithms to achieve relationship-based clustering and visualization, respectively. In the visualization framework, the output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. Results on retail transaction, document (bag-of-words), and web-log data show that our approach can yield superior results while also taking additional balance constraints into account.; The choice of similarity is a critical step in relationship-based clustering and this motivates our systematic comparative study of the impact of similarity measures on the quality of document clusters . The key findings of our experimental study are: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) graph partitioning tends to be superior to k-means and SOMs especially when balanced clusters are desired; and (iv) performance curves generally do not cross. We also propose a cluster quality evaluation measure based on normalized mutual information and find an analytical relation between similarity measures.; It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple clusterings. The idea of combining cluster labelings without accessing the original features leads to a general knowledge reuse framework that we call cluster ensembles. We propose a formal definition of the cluster ensemble as an optimization problem. Taking a relationship-based approach we propose three effective and efficient combining algorithms for solving it heuristically based on a hypergraph model. Results on synthetic as well as real data-sets show that cluster ensembles can (i) improve quality and robustness, and (ii) enable distributed clustering, and (iii) speed up processing significantly with little loss in quality.
机译:本文采用基于“关系”的方法对高(1000个或更多)维度数据进行聚类分析,该分析通过在适当的相似性空间而非原始特征中进行工作来避开“维数诅咒”问题空间。我们提出了两个利用图算法分别实现基于关系的聚类和可视化的框架。在可视化框架中,聚类算法的输出用于对数据点进行重新排序,以便可以轻松地在二维中可视化生成的置换相似矩阵,并且聚类显示为带。零售交易,单据(单词袋)和网络日志数据的结果表明,我们的方法可以产生出众的结果,同时还考虑了其​​他余额限制。相似度的选择是基于关系的聚类中的关键步骤,这激发了我们对相似​​度度量对文档聚类质量的影响的系统性比较研究。我们的实验研究的主要发现是:(i)余弦,相关性和扩展的Jaccard相似性可比; (ii)欧几里得距离的效果不好; (iii)图形划分往往优于 k 均值和SOM,尤其是在需要平衡集群的情况下; (iv)性能曲线通常不会交叉。我们还提出了一种基于标准化互信息的聚类质量评估措施,并找到相似性度量之间的分析关系。众所周知,与使用单独的,经过良好调整的模型相比,组合多个分类或回归模型通常可以提供更好的结果。但是,没有众所周知的方法可以组合多个聚类。组合群集标签而不访问原始功能的想法导致了一个通用的知识重用框架,我们称之为 cluster sembles 。我们提出了一个集群集成的形式化定义,作为一个优化问题。采取基于关系的方法,我们提出了三种有效且高效的组合算法,用于基于超图模型启发式求解该算法。综合数据集和真实数据集的结果表明,聚类集成可以(i)提高质量和鲁棒性,并且(ii)启用分布式聚类,并且(iii)显着加快处理速度,而质量几乎没有损失。

著录项

  • 作者

    Strehl, Alexander.;

  • 作者单位

    The University of Texas at Austin.;

  • 授予单位 The University of Texas at Austin.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2002
  • 页码 215 p.
  • 总页数 215
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:46:06

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号