A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

Jin Ran; Kou Chunhai; Liu Ruijuan; Guo Tao

首页> 外文期刊>Technical Gazette >A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

【24h】

A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

机译：使用采样的大规模数据集基于分区的聚类通用框架及其MapReduce实现

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering is one of the significant tasks in data mining, and partition-based clustering algorithms such as k-means are one of the popular solutions. However, with the increasing development of cloud computing and big data, large scale dataset has been a big challenge for clustering. For example, the execution of clustering algorithm is too time-consuming, the optimization of parameters is difficult, and the quality of clusters is not good. To this end, in this paper, we proposed a common framework of partition-based clustering algorithms such as k-means, and designed its MapReduce implementation. Specifically, in order to deal with the representation of large scale dataset, we propose to employ sampling technique. Then, inspired by k-means algorithm, we propose a common procedure of clustering, and provide a k-means based implementation. Furthermore, we implement proposed framework using MapReduce programming model. Experiments show that our method is efficient for large scale dataset.

机译：聚类是数据挖掘中的重要任务之一，而基于分区的聚类算法（例如k均值）是流行的解决方案之一。但是，随着云计算和大数据的不断发展，大规模数据集已经成为集群的一大挑战。例如，聚类算法的执行太耗时，参数优化困难，聚类质量不好。为此，本文提出了一个基于分区的聚类算法（如k-means）的通用框架，并设计了其MapReduce实现。具体来说，为了处理大规模数据集的表示，我们建议采用采样技术。然后，在k均值算法的启发下，我们提出了一种通用的聚类过程，并提供了一种基于k均值的实现方法。此外，我们使用MapReduce编程模型来实现所提出的框架。实验表明，该方法对大规模数据集有效。

著录项

来源
《Technical Gazette》 |2016年第1期|共9页
作者
Jin Ran; Kou Chunhai; Liu Ruijuan; Guo Tao;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类一般工业技术;
关键词

相似文献

外文文献
中文文献
专利

1. Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions [J] . Tanvir Habib Sardar, Zahid Ansari Future Computing and Informatics Journal . 2018,第2期

机译：使用MapReduce框架对大型数据集进行基于分区的聚类：最近主题和方向的分析
2. CloudNMF: A MapReduce Implementation of Nonnegative Matrix Factorization for Large-scale Biological Datasets [J] . Ruiqi Liao, Yifan Zhang, Jihong Guan, Genomics, proteomics & bioinformatics . 2014,第1期

机译：CloudNMF：大规模生物数据集非负矩阵分解的MapReduce实现
3. CloudNMF: A MapReduce Implementation of Nonnegative Matrix Factorization for Large-scale Biological Datasets [J] . Ruiqi Liao, Yifan Zhang, Jihong Guan, 基因组蛋白质组与生物信息学报（英文版） . 2014,第001期

机译：CloudNMF：大规模生物数据集非负矩阵分解的MapReduce实现
4. Reengineering High-throughput Molecular Datasets for Scalable Clustering Using MapReduce [C] . Estrada Trilce, Zhang Boyu, Taufer Michela, The 14th IEEE International Conference on High Performance Computing and Communication ; The 9th IEEE International Conference on Embedded Software and Systems. . 2012

机译：使用MapReduce重新设计高通量分子数据集以进行可扩展的聚类
5. Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets. [D] . Gadre, Hrishikesh. 2011

机译：研究MapReduce框架扩展，以有效处理地理上分散的数据集。
6. CloudNMF: A MapReduce Implementation of Nonnegative Matrix Factorization for Large-scale Biological Datasets [O] . Ruiqi Liao, Yifan Zhang, Jihong Guan, 2014

机译：CloudNMF：大规模生物数据集非负矩阵分解的MapReduce实现
7. A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation [O] . 2016

机译：使用采样及其MapReduce实现的大规模数据集的基于分区群集的共同框架

A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

摘要

著录项

相似文献

相关主题

期刊订阅