首页> 外文会议>Pacific-Asia Conference on Knowledge Discovery and Data Mining >CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES
【24h】

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES

机译:使用混合数字和分类值群集大数据集

获取原文
获取外文期刊封面目录资料

摘要

Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the k-means. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters.
机译:将大数据集的高效分区变为同质集群是数据挖掘中的一个基本问题。由于其计算效率低,标准分层聚类方法没有为此问题提供解决方案。基于K-Means的方法是希望其在处理大数据集中的效率。但是,它们的使用通常限于数字数据。在本文中,我们提出了一种基于K-Means Paradigm的k原型算法,但是在保持其效率时去除数字数据限制。在算法中,对象将针对K原型群集。开发了一种方法以动态更新K原型以最大化对象的帧内簇相似性。当应用于数字数据时,算法与K均值相同。为了帮助解释群集我们使用决策树诱导算法来为集群创建规则。这些规则与其他关于集群的其他统计数据一起,可以帮助数据矿工理解和识别有趣的集群。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号