首页> 外文会议>ACM SIGKDD international conference on Knowledge discovery and data mining >A robust and scalable clustering algorithm for mixed type attributes in large database environment
【24h】

A robust and scalable clustering algorithm for mixed type attributes in large database environment

机译:大型数据库环境中用于混合类型属性的健壮且可扩展的聚类算法

获取原文

摘要

Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.
机译:集群是数据挖掘应用程序中一种广泛使用的技术,用于发现基础数据中的模式。大多数传统的聚类算法仅限于处理包含连续或分类属性的数据集。但是,在现实生活中的数据挖掘问题中,具有混合类型的属性的数据集很常见。在本文中,我们提出了一种距离度量,该距离度量可以对具有连续和分类属性的数据进行聚类。该距离量度是从一个概率模型得出的,该概率模型的两个聚类之间的距离等于由于合并而导致的对数似然函数的减小。此度量的计算效率很高,因为它仅取决于合并的群集对,而不取决于所有其他群集。 Zhang等[8]提出了一种称为BIRCH的聚类方法,该方法特别适用于非常大的数据集。我们使用基于BIRCH框架的距离度量来开发聚类算法。与BIRCH相似,我们的算法首先通过扫描整个数据集并以汇总统计数据的形式存储数据记录的密集区域,从而执行预聚类步骤。然后应用分层聚类算法对密集区域进行聚类。除了能够处理混合类型的属性之外,我们的算法与BIRCH的不同之处在于,我们添加了使算法能够自动确定适当数量的聚类的过程,以及为噪声数据分配聚类成员资格的新策略。对于具有混合类型属性的数据,我们的实验结果证实,该算法不仅比传统的 k -means算法生成质量更好的聚类,而且还具有良好的可伸缩性,并且能够识别基础数量正确显示数据中的群集。该算法在商业数据挖掘工具Clementine 6.0中实现,该工具支持数据挖掘模型部署的PMML标准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号