首页> 外文学位 >Relationship-based clustering and cluster ensembles for high-dimensional data mining.

【24h】

Relationship-based clustering and cluster ensembles for high-dimensional data mining.

机译：用于高维数据挖掘的基于关系的聚类和聚类集成。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This dissertation takes a relationship-based approach to cluster analysis of high (1000 and more) dimensional data that side-steps the ‘curse of dimensionality’ issue by working in a suitable similarity space instead of the original feature space. We propose two frameworks that leverage graph algorithms to achieve relationship-based clustering and visualization, respectively. In the visualization framework, the output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. Results on retail transaction, document (bag-of-words), and web-log data show that our approach can yield superior results while also taking additional balance constraints into account.; The choice of similarity is a critical step in relationship-based clustering and this motivates our systematic comparative study of the impact of similarity measures on the quality of document clusters . The key findings of our experimental study are: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) graph partitioning tends to be superior to k-means and SOMs especially when balanced clusters are desired; and (iv) performance curves generally do not cross. We also propose a cluster quality evaluation measure based on normalized mutual information and find an analytical relation between similarity measures.; It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple clusterings. The idea of combining cluster labelings without accessing the original features leads to a general knowledge reuse framework that we call cluster ensembles. We propose a formal definition of the cluster ensemble as an optimization problem. Taking a relationship-based approach we propose three effective and efficient combining algorithms for solving it heuristically based on a hypergraph model. Results on synthetic as well as real data-sets show that cluster ensembles can (i) improve quality and robustness, and (ii) enable distributed clustering, and (iii) speed up processing significantly with little loss in quality.

机译：本文采用基于“关系”的方法对高（1000个或更多）维度数据进行聚类分析，该分析通过在适当的相似性空间而非原始特征中进行工作来避开“维数诅咒”问题空间。我们提出了两个利用图算法分别实现基于关系的聚类和可视化的框架。在可视化框架中，聚类算法的输出用于对数据点进行重新排序，以便可以轻松地在二维中可视化生成的置换相似矩阵，并且聚类显示为带。零售交易，单据（单词袋）和网络日志数据的结果表明，我们的方法可以产生出众的结果，同时还考虑了其他余额限制。相似度的选择是基于关系的聚类中的关键步骤，这激发了我们对相似度度量对文档聚类质量的影响的系统性比较研究。我们的实验研究的主要发现是：（i）余弦，相关性和扩展的Jaccard相似性可比; （ii）欧几里得距离的效果不好; （iii）图形划分往往优于 k 均值和SOM，尤其是在需要平衡集群的情况下；（iv）性能曲线通常不会交叉。我们还提出了一种基于标准化互信息的聚类质量评估措施，并找到相似性度量之间的分析关系。众所周知，与使用单独的，经过良好调整的模型相比，组合多个分类或回归模型通常可以提供更好的结果。但是，没有众所周知的方法可以组合多个聚类。组合群集标签而不访问原始功能的想法导致了一个通用的知识重用框架，我们称之为 cluster sembles 。我们提出了一个集群集成的形式化定义，作为一个优化问题。采取基于关系的方法，我们提出了三种有效且高效的组合算法，用于基于超图模型启发式求解该算法。综合数据集和真实数据集的结果表明，聚类集成可以（i）提高质量和鲁棒性，并且（ii）启用分布式聚类，并且（iii）显着加快处理速度，而质量几乎没有损失。

著录项

作者
Strehl, Alexander.;
展开▼
作者单位

The University of Texas at Austin.;

展开▼
授予单位 The University of Texas at Austin.;
学科 Computer Science.
学位 Ph.D.
年度 2002
页码 215 p.
总页数 215
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
入库时间 2022-08-17 11:46:06

相似文献

外文文献
中文文献
专利

1. Relationship-Based Clustering and Visualization for High-Dimensional Data Mining [J] . Alexander Strehl, Joydeep Ghosh ORSA Journal on Computing . 2003,第2期

机译：高维数据挖掘的基于关系的聚类和可视化
2. Ensemble decision forest of RBF networks via hybrid feature clustering approach for high-dimensional data classification [J] . Abpeykar Shadi, Ghatee Mehdi, Zare Hadi Computational statistics & data analysis . 2019,第期

机译：RBF网络的集合决策林通过混合特征聚类方法进行高维数据分类方法
3. AN ENSEMBLE CLUSTERING FOR MINING HIGH-DIMENSIONAL BIOLOGICAL BIG DATA [J] . DEWAN MD. FARID, ANN NOWE, BERNARD MANDERICK International journal of design & nature and ecodynamics . 2016,第3期

机译：用于高维生物大数据挖掘的可聚类
4. A feature grouping method for ensemble clustering of high-dimensional genomic big data [C] . Dewan Md. Farid, Ann Nowe, Bernard Manderick 2016 Future Technologies Conference . 2016

机译：高维基因组大数据集成聚类的特征分组方法
5. High-Dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products. [D] . Zhou, Dunke. 2012

机译：高维数据聚类和基于聚类的数据汇总产品的统计分析。
6. Evaluation of an ensemble-based distance statistic for clustering MLST datasets using epidemiologically defined clusters of cyclosporiasis [O] . Fernanda S. Nascimento, Joel Barratt, Katelyn Houghton, 2020

机译：使用流行病学定义的环孢菌素簇聚类MLST数据集的基于集群基于距离统计的评估
7. Relationship-Based Clustering and Visualization for High-Dimensional Data Mining [O] . Alexander Strehl, Joydeep Ghosh 2002

机译：高维数据挖掘的基于关系的聚类和可视化

Relationship-based clustering and cluster ensembles for high-dimensional data mining.

摘要

著录项

相似文献

相关主题

期刊订阅