首页> 外文期刊>Technical Gazette >A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation
【24h】

A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

机译:使用采样的大规模数据集基于分区的聚类通用框架及其MapReduce实现

获取原文
           

摘要

Clustering is one of the significant tasks in data mining, and partition-based clustering algorithms such as k-means are one of the popular solutions. However, with the increasing development of cloud computing and big data, large scale dataset has been a big challenge for clustering. For example, the execution of clustering algorithm is too time-consuming, the optimization of parameters is difficult, and the quality of clusters is not good. To this end, in this paper, we proposed a common framework of partition-based clustering algorithms such as k-means, and designed its MapReduce implementation. Specifically, in order to deal with the representation of large scale dataset, we propose to employ sampling technique. Then, inspired by k-means algorithm, we propose a common procedure of clustering, and provide a k-means based implementation. Furthermore, we implement proposed framework using MapReduce programming model. Experiments show that our method is efficient for large scale dataset.
机译:聚类是数据挖掘中的重要任务之一,而基于分区的聚类算法(例如k均值)是流行的解决方案之一。但是,随着云计算和大数据的不断发展,大规模数据集已经成为集群的一大挑战。例如,聚类算法的执行太耗时,参数优化困难,聚类质量不好。为此,本文提出了一个基于分区的聚类算法(如k-means)的通用框架,并设计了其MapReduce实现。具体来说,为了处理大规模数据集的表示,我们建议采用采样技术。然后,在k均值算法的启发下,我们提出了一种通用的聚类过程,并提供了一种基于k均值的实现方法。此外,我们使用MapReduce编程模型来实现所提出的框架。实验表明,该方法对大规模数据集有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号