首页> 外文会议>IEEE International Conference on Data Mining Workshops >A Multi Density-Based Clustering Algorithm for Data Stream with Noise
【24h】

A Multi Density-Based Clustering Algorithm for Data Stream with Noise

机译:一种基于多密度的噪声数据流聚类算法

获取原文

摘要

Density-based clustering can detect arbitrary shape clusters, handle outliers and do not need the number of clusters in advance. However, they cannot work properly in multi density environments. The existing multi density clustering algorithms have some problems in order to be applicable for data streams such as the need of whole data to perform clustering, two-pass clustering and high execution time. Data stream arrives continuously and they have to be processed in limited time and memory. Therefore, we need an algorithm to cluster data stream with different densities as well as to overcome the challenges in clustering data streams. In this paper, we introduce a Multi-Density clustering algorithm for data stream called MuDi-Stream. MuDi-Stream is an online-offline clustering algorithm, in which the online phase forms core-mini-clusters using a new proposed core distance and offline phase clusters the core-mini-clusters based on a density-based method. The new core distance called mini core distance is calculated based on the number of neighboring data points around the core. Therefore, the algorithm has different core distances for different clusters that leads to cover multi density environments. A novel pruning strategy is also used to filter out the real data from the noise by mapping the outliers in the grid. The grid structure keeps the neighbors of the data point to determine mini-core distance and remove noise effectively. Our performance study over synthetic data sets demonstrates effectiveness of our method.
机译:基于密度的聚类可以检测任意形状的聚类,可以处理异常值,并且不需要事先提供多个聚类。但是,它们无法在多密度环境中正常工作。现有的多密度聚类算法存在一些问题,以适用于数据流,例如需要整个数据来执行聚类,两次遍历聚类和高执行时间。数据流连续到达,因此必须在有限的时间和内存中进行处理。因此,我们需要一种算法来对具有不同密度的数据流进行聚类以及克服对数据流进行聚类的挑战。在本文中,我们介绍了一种针对数据流的多密度聚类算法,称为MuDi-Stream。 MuDi-Stream是一种在线-离线聚类算法,其中在线阶段使用新提出的核心距离形成核心-微型集群,而离线阶段则基于基于密度的方法对核心-微型集群进行聚类。新的核心距离称为迷你核心距离,是根据核心周围的相邻数据点的数量来计算的。因此,该算法对于不同的簇具有不同的核心距离,从而导致覆盖多密度环境。通过映射网格中的异常值,还使用了新颖的修剪策略从噪声中滤除实际数据。网格结构使数据点的邻居保持联系,从而确定最小核距离并有效地消除噪声。我们对综合数据集的性能研究证明了我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号