...
首页> 外文期刊>Knowledge and information systems >Effective data summarization for hierarchical clustering in large datasets
【24h】

Effective data summarization for hierarchical clustering in large datasets

机译:有效的数据汇总,可用于大型数据集中的层次聚类

获取原文
获取原文并翻译 | 示例

摘要

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link is a distance-based hierarchical clustering method, which can find non-convex (arbitrary)-shaped clusters in a dataset. However, this method cannot be used for clustering large dataset as this method either keeps entire dataset in main memory or scans dataset multiple times from secondary memory of the machine. Both of them are potentially severe problems for cluster analysis in large datasets. One remedy for both problems is to create a summary of a given dataset efficiently, and the summary is subsequently used to speed up clustering methods in large datasets. In this paper, we propose a summarization scheme termed data sphere (ds) to speed up single-link clustering method in large datasets. The ds utilizes sequential leaders clustering method to collect important statistics of a given dataset. The single-link method is modified to work with ds. Modified clustering method is termed as summarized single-link (SSL). The SSL method is considerably faster than the single-link method applied directly to the dataset, and clustering results produced by SSL method are close to the clustering results produced by single-link method. The SSL method outperforms single-link using data bubble (summarization scheme) both in terms of clustering accuracy and computation time. To speed up proposed summarization scheme, a technique is introduced to reduce a large number of distance computations in leaders method. Experimental studies demonstrate effectiveness of the proposed summarization scheme for large datasets.
机译:在科学和工程学的许多领域中,大型数据集中的聚类分析是一个有趣的挑战。一种重要的聚类方法是层次聚类,它可以输出给定数据集的层次(嵌套)结构。单链接是基于距离的层次聚类方法,可以在数据集中找到非凸(任意)形的聚类。但是,此方法不能用于对大型数据集进行聚类,因为该方法会将整个数据集保留在主内存中,或者从计算机的辅助内存中多次扫描数据集。对于大型数据集的聚类分析,这两个都是潜在的严重问题。解决这两个问题的一种方法是有效地创建给定数据集的摘要,然后将该摘要用于加快大型数据集中的聚类方法。在本文中,我们提出了一种称为数据领域(ds)的汇总方案,以加快大型数据集中的单链接聚类方法。 ds利用顺序领导者聚类方法来收集给定数据集的重要统计信息。修改了单链接方法以与ds一起使用。修改的群集方法称为汇总单链路(SSL)。 SSL方法比直接应用于数据集的单链接方法要快得多,并且SSL方法产生的聚类结果接近单链接方法产生的聚类结果。 SSL方法在聚类精度和计算时间方面均优于使用数据气泡(摘要方案)的单链接。为了加快提出的摘要方案,引入了一种减少前导方法中大量距离计算的技术。实验研究证明了针对大型数据集提出的汇总方案的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号