【24h】

Data bubbles

机译:数据泡泡

获取原文

摘要

In this paper, we investigate how to scale hierarchical clustering methods (such as OPTICS) to extremely large databases by utilizing data compression methods (such as BIRCH or random sampling). We propose a three step procedure: 1) compress the data into suitable representative objects; 2) apply the hierarchical clustering algorithm only to these objects; 3) recover the clustering structure for the whole data set, based on the result for the compressed data. The key issue in this approach is to design compressed data items such that not only a hierarchical clustering algorithm can be applied, but also that they contain enough information to infer the clustering structure of the original data set in the third step. This is crucial because the results of hierarchical clustering algorithms, when applied naively to a random sample or to the clustering features (CFs) generated by BIRCH, deteriorate rapidly for higher compression rates. This is due to three key problems, which we identify. To solvethese problems, we propose an efficient post-processing step and the concept of a Data Bubble as a special kind of compressed data item. Applying OPTICS to these Data Bubbles allows us to recover a very accurate approximation of the clustering structure of a large data set even for very high compression rates. A comprehensive performance and quality evaluation shows that we only trade very little quality of the clustering result for a great increase in performance.

机译:

在本文中,我们研究了如何利用数据压缩方法(例如BIRCH或随机抽样)将分层聚类方法(例如OPTICS)扩展到超大型数据库。我们提出了一个三步过程:1)将数据压缩为合适的代表性对象; 2)仅将层次聚类算法应用于这些对象; 3)根据压缩数据的结果,恢复整个数据集的聚类结构。此方法的关键问题是设计压缩数据项,这样不仅可以应用分层聚类算法,而且它们还包含足够的信息以推断第三步中原始数据集的聚类结构。这是至关重要的,因为层次化聚类算法的结果如果天真地应用于随机样本或BIRCH生成的聚类特征(CF),对于更高的压缩率会迅速恶化。这是由于我们确定了三个关键问题。为了解决这些问题,我们提出了一个有效的后处理步骤,并提出了将数据气泡作为一种特殊的压缩数据项的概念。将OPTICS应用于这些Data Bubbles,即使在非常高的压缩率的情况下,也可以恢复大型数据集的聚类结构的非常精确的近似值。全面的性能和质量评估表明,我们只用很少的聚类结果质量就可以大大提高性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号