首页> 外文会议>International conference on very large databases >Data Bubbles for Non-Vector Data: Speeding-up Hierarchical Clustering in Arbitrary Metric Spaces
【24h】

Data Bubbles for Non-Vector Data: Speeding-up Hierarchical Clustering in Arbitrary Metric Spaces

机译:非矢量数据的数据泡沫:在任意度量空间中加速分层群集

获取原文

摘要

To speed-up clustering algorithms, data summarization methods have been proposed, which first summarize the data set by computing suitable representative objects. Then, a clustering algorithm is applied to these representatives only, and a clustering structure for the whole data set is derived, based on the result for the representatives. Most previous methods are, however, limited in their application domain. They are in general based on sufficient statistics such as the linear sum of a set of points, which assumes that the data is from a vector space. On the other hand, in many important applications, the data is from a metric non-vector space, and only distances between objects can be exploited to construct effective data summarizations. In this paper, we develop a new data summarization method based only on distance information that can be applied directly to non-vector data. An extensive performance evaluation shows that our method is very effective in finding the hierarchical clustering structure of non-vector data using only a very small number of data summarizations, thus resulting in a large reduction of runtime while trading only very little clustering quality.
机译:为了加速聚类算法,已经提出了数据摘要方法,该方法首先通过计算合适的代表性对象来总结数据集。然后,仅将聚类算法应用于这些代表,并且基于代表的结果,导出整个数据集的聚类结构。然而,最先前的方法在其应用程序域中有限。它们通常基于足够的统计数据,例如一组点的线性和,这假设数据来自矢量空间。另一方面,在许多重要的应用中,数据来自度量非矢量空间,并且只能利用对象之间的距离来构建有效的数据摘要。在本文中,我们仅基于可以直接应用于非向量数据的距离信息进行新的数据摘要方法。广泛的性能评估表明,我们的方法在仅使用非常少量的数据摘要中找到非向量数据的分层聚类结构非常有效,从而导致运行时的大量减少,同时仅交易很少的聚类质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号