首页> 外文期刊>Information Sciences: An International Journal >A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data
【24h】

A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data

机译:针对混合数据自行确定簇中心的基于密度的快速数据流聚类算法

获取原文
获取原文并翻译 | 示例
           

摘要

Most data streams encountered in real life are data objects with mixed numerical and categorical attributes. Currently most data stream algorithms have shortcomings including low clustering quality, difficulties in determining cluster centers, poor ability for dealing with outliers' issue. A fast density-based data stream clustering algorithm with cluster centers automatically determined in the initialization stage is proposed. Based on data attribute relationships analysis, mixed data sets are filed into three types whose corresponding distance measure metrics are designed. Based on field intensity-distance distribution graph for each data object, linear regression model and residuals analysis are used to find the outliers of the graph, enabling cluster centers automatic determination. After the cluster centers are found, all data objects can be clustered according to their distance with centers. The data stream clustering algorithm adopts an online/offline two-stage processing framework, and a new micro cluster characteristic vector to maintain the arriving data objects dynamically. Micro clusters decay function and deletion mechanism of micro clusters are used to maintain the micro clusters, which reflects the data stream evolution process accurately. Finally, the performances of the proposed algorithm are testified by a series of experiments on real-world mixed data sets in comparison with several outstanding clustering algorithms in terms of the clustering purity, efficiency and time complexity. (C) 2016 Elsevier Inc. All rights reserved.
机译:现实生活中遇到的大多数数据流都是具有混合数值和分类属性的数据对象。当前,大多数数据流算法都有缺点,包括聚类质量低,确定聚类中心困难,处理离群值问题的能力差。提出了一种在初始化阶段自动确定具有聚类中心的基于密度的快速数据聚类算法。基于数据属性关系分析,将混合数据集分为三种类型,分别设计了相应的距离度量标准。基于每个数据对象的场强-距离分布图,使用线性回归模型和残差分析来找到图的离群值,从而使聚类中心能够自动确定。找到聚类中心之后,可以根据所有数据对象与中心的距离对它们进行聚类。数据流聚类算法采用在线/离线两阶段处理框架,并采用新的微簇特征向量来动态维护到达的数据对象。利用微簇的衰变函数和微簇的删除机制来维护微簇,准确地反映了数据流的演进过程。最后,通过在真实世界混合数据集上进行的一系列实验证明了该算法的性能,并在聚类纯度,效率和时间复杂度方面与几种出色的聚类算法进行了比较。 (C)2016 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号