首页> 外文期刊>Applied Soft Computing >NOCEA: A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases
【24h】

NOCEA: A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases

机译:NOCEA:一种基于规则的进化算法,可高效,有效地对大型高维数据库进行聚类

获取原文
获取原文并翻译 | 示例
           

摘要

Clustering is a descriptive data mining task aiming to group the data into homogeneous groups. This paper presents a novel evolutionary algorithm (NOCEA) that efficiently and effectively clusters massive numerical databases. NOCEA evolves individuals of variable-length consisting of disjoint and axis-aligned hyper-rectangular rules with homogeneous data distribution. The antecedent part of the rules includes an interval-like condition for each dimension. A novel quantisation algorithm imposes a regular multi-dimensional grid structure onto the data space to reduce the search combinations. Due to quantisation the boundaries of the intervals are encoded as integer values. The evolutionary search is guided by a simple data coverage maximisation function. The enormous data space is effectively explored by task-specific recombination and mutation operators producing candidate solutions with no overlapping rules. A parsimony generalisation operator shortens the discovered knowledge by replacing adjacent rules with more generic ones. NOCEA employs a special homogeneity operator that enforces quasi-uniform data distribution in the space enclosed by the candidate rules. After convergence the discovered knowledge undergoes simplification to perform subspace clustering, and to assemble the clusters. Results using real-world datasets are included to show that NOCEA has several attractive properties for clustering including: (a) comprehensible output in the form of disjoint and homogeneous rules, (b) the ability to discover clusters of arbitrary shape, density, size, and data coverage, (c) ability to perform effective subspace clustering, (d) near linear scalability with the database size, data and cluster dimensionality, and (e) substantial potential for task parallelism (speedup of 13.8 on 16 processors). A real-world example is a detailed study of the seismicity along the African-Eurasian-Arabian plate boundaries.
机译:聚类是描述性的数据挖掘任务,旨在将数据分组为同类组。本文提出了一种新颖的进化算法(NOCEA),该算法可以高效地聚类海量数值数据库。 NOCEA演化出可变长度的个体,该个体由不相交且与轴对齐的超矩形规则组成,数据分布均匀。规则的前一部分包括每个维度的类似间隔的条件。一种新颖的量化算法将规则的多维网格结构强加到数据空间上,以减少搜索组合。由于量化,间隔的边界被编码为整数值。进化搜索以简单的数据覆盖率最大化功能为指导。特定于任务的重组和变异运算符可有效地探索巨大的数据空间,从而产生没有重叠规则的候选解决方案。简约概括运算符通过用更通用的规则替换相邻规则来缩短发现的知识。 NOCEA雇用了一个特殊的同质算子,该算子在候选规则所包围的空间中强制进行准均匀的数据分布。融合之后,发现的知识将进行简化以执行子空间聚类,并组装这些聚类。包括使用现实世界数据集得出的结果,表明NOCEA具有几个吸引人的聚类属性,包括:(a)不相交和均质规则形式的可理解输出;(b)发现任意形状,密度,大小, (c)执行有效子空间聚类的能力,(d)具有数据库大小,数据和群集维数的接近线性可伸缩性,以及(e)任务并行的巨大潜力(在16个处理器上加速13.8)。一个真实的例子是对非洲-欧亚-阿拉伯板块边界沿线地震活动的详细研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号