首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Clustering Data Streams Based on Shared Density between Micro-Clusters
【24h】

Clustering Data Streams Based on Shared Density between Micro-Clusters

机译:基于微型集群之间共享密度的数据流聚类

获取原文
获取原文并翻译 | 示例

摘要

As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro-clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-cluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results.
机译:随着越来越多的应用程序生成流数据,对数据流进行群集已成为数据和知识工程的一项重要技术。一种典型的方法是通过在线过程实时地将数据流汇总为大量所谓的微型集群。微观集群通过聚集定义区域中许多数据点的信息来表示局部密度估计。根据需要,在第二个脱机步骤中使用了(经过修改的)常规聚类算法,将微集群重新聚集成更大的最终聚类。为了进行聚类,将微团簇的中心用作伪点,并将密度估计值用作其权重。但是,关于微团簇之间区域密度的信息并未在在线过程中保留,重新整理基于关于微团簇内部和之间的数据分布的可能不准确的假设(例如,均匀分布或高斯分布)。本文介绍了DBSTREAM,这是第一个基于微集群的在线集群组件,该组件通过共享的密度图显式捕获微集群之间的密度。然后,根据相邻微集群之间的实际密度,利用此图中的密度信息进行聚类。我们讨论了维护共享密度图的时空复杂性。在广泛的合成和真实数据集上进行的实验表明,与其他流行的数据流聚类方法相比,使用共享密度可以提高聚类质量,而其他流行的数据流聚类方法需要创建大量较小的微聚类以获得可比的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号