A robust and scalable clustering algorithm for mixed type attributes in large database environment

机译：大型数据库环境中混合类型属性的鲁棒和可伸缩的聚类算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.

机译：群集是数据挖掘应用程序中的广泛使用技术，以发现底层数据中的模式。大多数传统聚类算法仅限于处理包含连续或分类属性的数据集。但是，具有混合类型的属性类型的数据集在现实生活数据挖掘问题中是常见的。在本文中，我们提出了一种距离测量，使得能够与连续和分类属性进行聚类数据。该距离测量来自概率模型，即两个集群之间的距离等同于由于合并而导致的对数似然函数的降低。计算该度量是存储器有效的，因为它仅取决于合并群集对而不是所有其他簇。 Zhang等[8]提出了一种名为Birch的聚类方法，该方法特别适用于非常大的数据集。我们使用基于桦木框架的距离测量来开发一种聚类算法。与桦树类似，我们的算法首先通过扫描整个数据集来执行预群集步骤，并在汇总统计中存储数据记录的密集区域。然后应用分层聚类算法以聚集密集区域。除了处理混合类型的属性的能力之外，我们的算法与桦树的不同之处在于，我们添加了一个过程，使算法能够自动确定适当数量的群集和分配群集成员资格的新策略到嘈杂的数据。对于具有混合类型的属性数据，我们的实验结果证实了该算法不仅会产生比传统的 k - eans算法产生更好的质量簇，而且还表现出良好的可扩展性属性，并且能够识别底层数量正确的数据中的群集。该算法在商业数据挖掘工具克莱门汀6.0中实现，支持数据挖掘模型部署的PMML标准。

著录项

来源
《ACM SIGKDD international conference on Knowledge discovery and data mining》|2001年||共6页
会议地点
作者
Tom Chiu; DongPing Fang; John Chen; Yao Wang; Christopher Jeris;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词
number of clusters;

机译：簇数量;

相似文献

外文文献
中文文献
专利

1. Scalable algorithms for clustering large datasets with mixed type attributes [J] . He ZY, Xu XF, Deng SC International Journal of Intelligent Systems . 2005,第10期

机译：具有混合类型属性的大型数据集聚类的可扩展算法
2. ROCKET: A Robust Parallel Algorithm for Clustering Large-Scale Transaction Databases [J] . Woong-Kee LOH, Yang-Sae MOON, Heejune AHN IEICE transactions on information and systems . 2011,第10期

机译：ROCKET：一种用于大型事务数据库集群的鲁棒并行算法
3. ROCKET: A Robust Parallel Algorithm for Clustering Large-Scale Transaction Databases [J] . Woong-Kee LOH, Yang-Sac MOON, Heejune AHN IEICE Transactions on Information and Systems . 2011,第10期

机译：ROCKET：一种用于大型事务数据库集群的鲁棒并行算法
4. A robust and scalable clustering algorithm for mixed type attributes in large database environment [C] . Tom Chiu, DongPing Fang, John Chen, ACM SIGKDD international conference on Knowledge discovery and data mining . 2001

机译：大型数据库环境中用于混合类型属性的健壮且可扩展的聚类算法
5. Scalable model-based clustering algorithms for large databases and their applications. [D] . Jin, Huidong. 2002

机译：适用于大型数据库及其应用程序的基于模型的可伸缩群集算法。
6. Combined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number K [O] . Timothy E. Sweeney, Albert C. Chen, Olivier Gevaert -1

机译：多个聚类算法的组合映射（公共）：选择簇数K的稳健方法
7. EM- and JMAP-ML Based Joint Estimation Algorithms for Robust Wireless Geolocation in Mixed LOS/NLOS Environments [O] . Yin, Feng, Fritsche, Carsten, Gustafsson, Fredrik, 2014

机译：基于EM和JMAP-ML的联合估计算法，用于混合LOS / NLOS环境中的稳健无线地理位置

A robust and scalable clustering algorithm for mixed type attributes in large database environment

摘要

著录项

相似文献

相关主题

期刊订阅