首页> 外文会议>ACM SIGKDD international conference on Knowledge discovery and data mining >A robust and scalable clustering algorithm for mixed type attributes in large database environment
【24h】

A robust and scalable clustering algorithm for mixed type attributes in large database environment

机译:大型数据库环境中混合类型属性的鲁棒和可伸缩的聚类算法

获取原文

摘要

Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.
机译:群集是数据挖掘应用程序中的广泛使用技术,以发现底层数据中的模式。大多数传统聚类算法仅限于处理包含连续或分类属性的数据集。但是,具有混合类型的属性类型的数据集在现实生活数据挖掘问题中是常见的。在本文中,我们提出了一种距离测量,使得能够与连续和分类属性进行聚类数据。该距离测量来自概率模型,即两个集群之间的距离等同于由于合并而导致的对数似然函数的降低。计算该度量是存储器有效的,因为它仅取决于合并群集对而不是所有其他簇。 Zhang等[8]提出了一种名为Birch的聚类方法,该方法特别适用于非常大的数据集。我们使用基于桦木框架的距离测量来开发一种聚类算法。与桦树类似,我们的算法首先通过扫描整个数据集来执行预群集步骤,并在汇总统计中存储数据记录的密集区域。然后应用分层聚类算法以聚集密集区域。除了处理混合类型的属性的能力之外,我们的算法与桦树的不同之处在于,我们添加了一个过程,使算法能够自动确定适当数量的群集和分配群集成员资格的新策略到嘈杂的数据。对于具有混合类型的属性数据,我们的实验结果证实了该算法不仅会产生比传统的 k - eans算法产生更好的质量簇,而且还表现出良好的可扩展性属性,并且能够识别底层数量正确的数据中的群集。该算法在商业数据挖掘工具克莱门汀6.0中实现,支持数据挖掘模型部署的PMML标准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号