CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES

机译：使用混合数字和分类值群集大数据集

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the k-means. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters.

机译：将大数据集的高效分区变为同质集群是数据挖掘中的一个基本问题。由于其计算效率低，标准分层聚类方法没有为此问题提供解决方案。基于K-Means的方法是希望其在处理大数据集中的效率。但是，它们的使用通常限于数字数据。在本文中，我们提出了一种基于K-Means Paradigm的k原型算法，但是在保持其效率时去除数字数据限制。在算法中，对象将针对K原型群集。开发了一种方法以动态更新K原型以最大化对象的帧内簇相似性。当应用于数字数据时，算法与K均值相同。为了帮助解释群集我们使用决策树诱导算法来为集群创建规则。这些规则与其他关于集群的其他统计数据一起，可以帮助数据矿工理解和识别有趣的集群。

著录项

来源
《Pacific-Asia Conference on Knowledge Discovery and Data Mining》|1997年||共14页
会议地点
作者
Zhexue Huang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-532;
关键词

相似文献

外文文献
中文文献
专利

1. Genetic K-Means Clustering Algorithm for Mixed Numeric and Categorical Data Sets [J] . Dharmendra K Roy, Lokesh K Sharma International Journal of Artificial Intelligence & Applications (IJAIA) . 2010,第2期

机译：混合数值和分类数据集的遗传K均值聚类算法
2. A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets [J] . Amir Ahmad, Lipika Dey Pattern recognition letters . 2011,第7期

机译：一种k均值类型聚类算法，用于混合数值和分类数据集的子空间聚类
3. Rough Sets Based Rule Generation from Data with Categorical and Numerical Values [J] . Hiroshi Sakai, Kazuhiro Koba, Michinori Nakata Journal of Advanced Computatioanl Intelligence and Intelligent Informatics . 2008,第5期

机译：从具有分类和数值的数据中基于粗糙集的规则生成
4. A CSA-based clustering algorithm for large data sets with mixed numeric and categorical values [C] . Li Jie, Gao Xinbo, Jiao Li-Cheng Intelligent Control and Automation, 2004. WCICA 2004. Fifth World Congress on . 2004

机译：基于CSA的聚类算法，用于混合数值和分类值的大型数据集
5. Automatic categorical data clustering and spatial data clustering by consecutive resolution refinement. [D] . Foss, Andrew Philip Ogilvie. 2002

机译：通过连续的分辨率优化自动分类数据聚类和空间数据聚类。
6. A new set-valued system identification approach to identifying rare genetic variants for ordered categorical phenotype [O] . Wenjian Bi, Guolian Kang, Yuehua Cui, 2014

机译：一种新的集值系统识别方法用于识别有序分类表型的稀有遗传变异
7. A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets [O] . Xia Que, Siyuan Jiang, Jiaoyun Yang, 2021

机译：具有基于熵的权重的相似性测量，用于聚类混合数值和分类数据集

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅