One-pass MapReduce-based clustering method for mixed large scale data

Ben HajKacem Mohamed Aymen; Ben Ncir Chiheb-Eddine; Essoussi Nadia

首页> 外文期刊>Journal of Intelligent Information Systems >One-pass MapReduce-based clustering method for mixed large scale data

【24h】

One-pass MapReduce-based clustering method for mixed large scale data

机译：基于一遍MapReduce的混合大规模数据聚类方法

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Big data is often characterized by a huge volume and a mixed types of attributes namely, numeric and categorical. K-prototypes has been fitted into MapReduce framework and hence it has become a solution for clustering mixed large scale data. However, k-prototypes requires computing all distances between each of the cluster centers and the data points. Many of these distance computations are redundant, because data points usually stay in the same cluster after first few iterations. Also, k-prototypes is not suitable for running within MapReduce framework: the iterative nature of k-prototypes cannot be modeled through MapReduce since at each iteration of k-prototypes, the whole data set must be read and written to disks and this results a high input/output (I/O) operations. To deal with these issues, we propose a new one-pass accelerated MapReduce-based k-prototypes clustering method for mixed large scale data. The proposed method reads and writes data only once which reduces largely the I/O operations compared to existing MapReduce implementation of k-prototypes. Furthermore, the proposed method is based on a pruning strategy to accelerate the clustering process by reducing the redundant distance computations between cluster centers and data points. Experiments performed on simulated and real data sets show that the proposed method is scalable and improves the efficiency of the existing k-prototypes methods.

机译：大数据通常具有庞大的数量和混合类型的属性（即数字和类别）的特点。 K原型已被纳入MapReduce框架，因此它已成为聚类大型数据的解决方案。但是，k原型需要计算每个聚类中心和数据点之间的所有距离。这些距离计算中有许多是多余的，因为数据点通常在前几次迭代后就位于同一簇中。同样，k原型也不适合在MapReduce框架中运行：k原型的迭代性质无法通过MapReduce建模，因为在k原型的每次迭代中，必须将整个数据集读取并写入磁盘，这会导致a高输入/输出（I / O）操作。为了解决这些问题，我们提出了一种用于混合大规模数据的基于MapReduce的单程加速k原型聚类新方法。与现有的k原型MapReduce实现相比，该方法仅读取和写入数据一次，从而大大减少了I / O操作。此外，提出的方法基于修剪策略，通过减少聚类中心和数据点之间的冗余距离计算来加速聚类过程。在模拟和真实数据集上进行的实验表明，该方法具有可扩展性，并提高了现有k原型方法的效率。

著录项

来源
《Journal of Intelligent Information Systems》 |2019年第3期|619-636|共18页
作者
Ben HajKacem Mohamed Aymen; Ben Ncir Chiheb-Eddine; Essoussi Nadia;
展开▼
作者单位

Univ Tunis, LARODEC, Inst Super Gest Tunis, 41 Ave Liberte, Le Bardo 2000, Tunisia;

Univ Tunis, LARODEC, Inst Super Gest Tunis, 41 Ave Liberte, Le Bardo 2000, Tunisia;

Univ Tunis, LARODEC, Inst Super Gest Tunis, 41 Ave Liberte, Le Bardo 2000, Tunisia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
K-prototypes; One-pass MapReduce; Large scale data; Mixed data; Pruning strategy;

机译：K型;一遍MapReduce;大型数据;混合数据;修剪策略;

相似文献

外文文献
中文文献
专利

1. One-pass MapReduce-based clustering method for mixed large scale data [J] . Ben HajKacem Mohamed Aymen, Ben Ncir Chiheb-Eddine, Essoussi Nadia Journal of Intelligent Information Systems . 2019,第3期

机译：基于MapReduce的混合大规模数据的聚类方法
2. A MapReduce-based artificial bee colony for large-scale data clustering [J] . Banharnsakun Anan Pattern recognition letters . 2017,第jula1期

机译：基于MapReduce的人工蜂群用于大规模数据聚类
3. A MapReduce-based parallel K-means clustering for large-scale CIM data verification [J] . Deng Chuang, Liu Yang, Xu Lixiong, Concurrency and computation: practice and experience . 2016,第11期

机译：基于MapReduce的并行K均值聚类用于大规模CIM数据验证
4. MapReduce-based Dragonfly Algorithm for large-scale Data-Clustering [C] . Ashish Kumar Tripathi, Pranav Saxena, Siddharth Gupta International Conference on Image Information Processing . 2019

机译：基于MapReduce的蜻蜓算法进行大规模数据聚类
5. Clustering Methods for Mixed-Type Data [D] . Foss, Alexander Hawthorne. 2017

机译：混合类型数据的聚类方法
6. Clustering Methods with Qualitative Data: A Mixed Methods Approach for Prevention Research with Small Samples [O] . David Henry, Allison B. Dymnicki, Nathaniel Mohatt, -1

机译：定性数据的聚类方法：小样本预防研究的混合方法
7. Accurate automated clustering of two-dimensional data for single-nucleotide polymorphism genotyping by a combination of clustering methods: evaluation by large-scale real data [O] . S. Takitoh, S. Fujii, Y. Mase, 2007

机译：通过聚类方法的组合精确自动聚类，用于单核苷酸多态性基因分型的单核苷酸多态性基因分型：大规模真实数据的评估

One-pass MapReduce-based clustering method for mixed large scale data

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅