首页> 外文会议>International Workshop on Embedded Multicore Systems >EDDS: An Enhanced Density-based Method for Clustering Data Streams
【24h】

EDDS: An Enhanced Density-based Method for Clustering Data Streams

机译:EDDS:用于聚类数据流的基于增强的基于密度的方法

获取原文

摘要

Data stream clustering is an active area of research in big data. It refers to clustering constantly arriving new data records and updating existing cluster patterns and outliers in light of the newly arriving data. Density-based algorithms for solving this problem have the promise for finding arbitrary shape clusters and detecting anomalies without prior knowledge of the number of clusters. In this paper, a new incremental algorithm known as Enhanced Density-based Data Stream (EDDS) is developed to overcome limitations with the existing solutions. The algorithm detects clusters and outliers in an incoming data chunk, merges new clusters from the chunk with the existing clusters, and filters out new outliers for the next round. It modified the traditional DBSCAN algorithm to summarise each cluster in terms of a set of surface-core points. The algorithm applies the density-reachable concept of DBSCAN as its merging strategy and prunes the internal core points using a heuristic solution. The algorithm also removes the aged core points and outliers depending on a fading function. The paper investigates three versions of the algorithm for three possible representations of clusters where either all core points are maintained (EDDS-I), only core points of the new clusters from the incoming chunk are kept (EDDS-II), or only the surface-core points of the cluster shapes are kept (EDDS-III) to examine the balance between the efficiency gain for the algorithm and the amount of overhead time committed for pruning internal core points. The algorithm was evaluated on selected datasets using various quality measures. The experimental results indicate improvements in terms of clustering correctness with a comparable time complexity over the existing solutions for solving the same kind of problems.
机译:数据流群集是大数据中的一个活动区域。它是指群集不断到达新数据记录以及根据新到达数据更新现有的群集模式和异常值。用于解决该问题的基于密度的算法具有寻找任意形状群集和检测异常而无需先验知识的群集。在本文中,开发了一种称为增强基于密度的数据流(EDDS)的新增算法以克服与现有解决方案的限制。该算法在传入数据块中检测到群集和异常值,将新的群集与现有群集的块合并,并为下一轮筛选出新的异常值。它修改了传统的DBSCAN算法,以一组表面核心点来汇总每个群集。该算法将DBSCAN的密度可达概念应用于其合并策略,并使用启发式解决方案修剪内核点。根据衰落功能,该算法还取消了老化的核心点和异常值。本文研究了三个版本的三个算法的三种可能的群集表示,其中所有核心点(EDDS-I),只有来自传入块的新集群的核心点被保存(EDDS-II),或者只有表面 - 群集形状的点保持(EDDS-III),以检查算法的效率增益之间的平衡和承诺用于修剪内核点的开销时间。使用各种质量措施对所选数据集进行评估该算法。实验结果表明,在对现有解决方案上进行了相当的时间复杂性,可以改善聚类正确性,以解决同一问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号