首页> 外文会议>ACM SIGKDD international conference on Knowledge discovery and data mining >A robust and scalable clustering algorithm for mixed type attributes in large database environment

【24h】

A robust and scalable clustering algorithm for mixed type attributes in large database environment

机译：大型数据库环境中用于混合类型属性的健壮且可扩展的聚类算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.

机译：集群是数据挖掘应用程序中一种广泛使用的技术，用于发现基础数据中的模式。大多数传统的聚类算法仅限于处理包含连续或分类属性的数据集。但是，在现实生活中的数据挖掘问题中，具有混合类型的属性的数据集很常见。在本文中，我们提出了一种距离度量，该距离度量可以对具有连续和分类属性的数据进行聚类。该距离量度是从一个概率模型得出的，该概率模型的两个聚类之间的距离等于由于合并而导致的对数似然函数的减小。此度量的计算效率很高，因为它仅取决于合并的群集对，而不取决于所有其他群集。 Zhang等[8]提出了一种称为BIRCH的聚类方法，该方法特别适用于非常大的数据集。我们使用基于BIRCH框架的距离度量来开发聚类算法。与BIRCH相似，我们的算法首先通过扫描整个数据集并以汇总统计数据的形式存储数据记录的密集区域，从而执行预聚类步骤。然后应用分层聚类算法对密集区域进行聚类。除了能够处理混合类型的属性之外，我们的算法与BIRCH的不同之处在于，我们添加了使算法能够自动确定适当数量的聚类的过程，以及为噪声数据分配聚类成员资格的新策略。对于具有混合类型属性的数据，我们的实验结果证实，该算法不仅比传统的 k -means算法生成质量更好的聚类，而且还具有良好的可伸缩性，并且能够识别基础数量正确显示数据中的群集。该算法在商业数据挖掘工具Clementine 6.0中实现，该工具支持数据挖掘模型部署的PMML标准。 展开▼

著录项

来源
《ACM SIGKDD international conference on Knowledge discovery and data mining 》|2001年|P.263-268|共6页

会议地点

作者
Tom Chiu; DongPing Fang; John Chen; Yao Wang; Christopher Jeris;
展开▼

作者单位

展开▼

会议组织

原文格式 PDF

正文语种

中图分类计算技术、计算机技术 ;

关键词
number of clusters;

机译：簇数;

相似文献

外文文献

中文文献

专利

1. Scalable algorithms for clustering large datasets with mixed type attributes [J] . He ZY, Xu XF, Deng SC International Journal of Intelligent Systems . 2005 ,第10期

机译：具有混合类型属性的大型数据集聚类的可扩展算法

2. ROCKET: A Robust Parallel Algorithm for Clustering Large-Scale Transaction Databases [J] . Woong-Kee LOH, Yang-Sae MOON, Heejune AHN IEICE transactions on information and systems . 2011 ,第10期

机译：ROCKET：一种用于大型事务数据库集群的鲁棒并行算法

3. ROCKET: A Robust Parallel Algorithm for Clustering Large-Scale Transaction Databases [J] . Woong-Kee LOH, Yang-Sac MOON, Heejune AHN IEICE Transactions on Information and Systems . 2011 ,第10期

机译：ROCKET：一种用于大型事务数据库集群的鲁棒并行算法

4. A robust and scalable clustering algorithm for mixed type attributes in large database environment [C] . Tom Chiu, DongPing Fang, John Chen, ACM SIGKDD international conference on Knowledge discovery and data mining . 2001

机译：大型数据库环境中混合类型属性的鲁棒和可伸缩的聚类算法

5. Scalable model-based clustering algorithms for large databases and their applications. [D] . Jin, Huidong. 2002

机译：适用于大型数据库及其应用程序的基于模型的可伸缩群集算法。

6. Combined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number K [O] . Timothy E. Sweeney, Albert C. Chen, Olivier Gevaert -1

机译：多个聚类算法的组合映射（公共）：选择簇数K的稳健方法

7. EM- and JMAP-ML Based Joint Estimation Algorithms for Robust Wireless Geolocation in Mixed LOS/NLOS Environments [O] . Yin, Feng, Fritsche, Carsten, Gustafsson, Fredrik, 2014

机译：基于EM和JMAP-ML的联合估计算法，用于混合LOS / NLOS环境中的稳健无线地理位置

1. 基于维度属性距离的混合属性近邻传播聚类算法 [J] . 黄德才 ,钱潮恺 . 计算机科学 . 2015 ,第B11期

2. 一种处理混合型属性的聚类算法在计算机取证中的应用 [J] . 黄斌 ,史亮 ,陈德礼 . 陕西科技大学学报（自然科学版） . 2010 ,第002期

3. 一种处理混合型属性的聚类算法及其在入侵检测中的应用 [J] . 黄斌 ,史亮 ,姜青山 . 计算机研究与发展 . 2007 ,第0z2期

4. 一种用于网站用户行为分析数据的可扩展协同聚类算法 [J] . 库波 ,晁学鹏 . 科技通报 . 2013 ,第2期

5. 基于编码类型和属性可扩展的企业信息编码系统的设计与实现 [J] . 肖洪兰 . 计算机光盘软件与应用 . 2011 ,第019期

6. 基于数值型和分类型混合属性数据集的聚类算法研究 [C] . 曹露燕 ,蒋晓云 ,孟凡荣 . 2006年全国数学技术应用科学学术论坛 . 2006

7. 针对混合数值型和分类型属性数据的划分式聚类算法研究 [A] . 申罡 . 2015

1. 在采用可变比特率的系统中健壮的帧类型保护方法与系统 [P] . 中国专利： CN1232056C . 2005.12.14

2. 在采用可变比特率的系统中健壮的帧类型保护方法与系统 [P] . 中国专利： CN1379937A . 2002-11-13

3. A SCALABLE SYSTEM FOR CLUSTERING OF LARGE DATABASES HAVING MIXED DATA ATTRIBUTES [P] . 外国专利： EP1090362A4 . 2007-05-02

机译：具有混合数据属性的大型数据库的可伸缩系统

4. Scalable system for clustering of large databases having mixed data attributes [P] . 外国专利： US6581058B1 . 2003-06-17

机译：用于具有混合数据属性的大型数据库集群的可伸缩系统

5. A SCALABLE SYSTEM FOR CLUSTERING OF LARGE DATABASES HAVING MIXED DATA ATTRIBUTES [P] . 外国专利： EP1090362A1 . 2001-04-11

机译：具有混合数据属性的大型数据库的可伸缩系统

相关主题

A robust and scalable clustering algorithm for mixed type attributes in large database environment

摘要

著录项

相似文献

相关主题

期刊订阅