Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

Chih-Fong Tsai; Wei-Chao Lin; Shih-Wen Ke

首页> 外文期刊>The Journal of Systems and Software >Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

【24h】

Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

机译：利用并行计算进行大数据挖掘：分布式方法与MapReduce方法的比较

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Mining with big data or big data mining has become an active research area. It is very difficult using current methodologies and data mining software tools for a single personal computer to efficiently deal with very large datasets. The parallel and cloud computing platforms are considered a better solution for big data mining. The concept of parallel computing is based on dividing a large problem into smaller ones and each of them is carried out by one single processor individually. In addition, these processes are performed concurrently in a distributed and parallel manner. There are two common methodologies used to tackle the big data problem. The first one is the distributed procedure based on the data parallelism paradigm, where a given big dataset can be manually divided into n subsets, and n algorithms are respectively executed for the corresponding n subsets. The final result can be obtained from a combination of the outputs produced by the n algorithms. The second one is the MapReduce based procedure under the cloud computing platform. This procedure is composed of the map and reduce processes, in which the former performs filtering and sorting and the later performs a summary operation in order to produce the final result In this paper, we aim to compare the performance differences between the distributed and MapReduce methodologies over large scale datasets in terms of mining accuracy and efficiency. The experiments are based on four large scale datasets, which are used for the data classification problems. The results show that the classification performances of the MapReduce based procedure are very stable no matter how many computer nodes are used, better than the baseline single machine and distributed procedures except for the class imbalance dataset. In addition, the MapReduce procedure requires the least computational cost to process these big datasets.

机译：大数据挖掘或大数据挖掘已成为一个活跃的研究领域。对于一台个人计算机，使用当前的方法和数据挖掘软件工具很难有效地处理非常大的数据集。并行和云计算平台被认为是大数据挖掘的更好解决方案。并行计算的概念基于将大问题分解为较小的问题，并且每个问题均由一个处理器单独执行。另外，这些处理以分布式和并行的方式同时执行。有两种用于解决大数据问题的常用方法。第一个是基于数据并行性范式的分布式过程，其中可以将给定的大数据集手动划分为n个子集，并对相应的n个子集分别执行n个算法。最终结果可以从n种算法产生的输出的组合中获得。第二个是云计算平台下基于MapReduce的过程。该过程由映射和归约过程组成，其中前者执行过滤和排序，而后者执行汇总操作以产生最终结果。本文旨在比较分布式方法与MapReduce方法之间的性能差异挖掘准确性和效率方面的大规模数据集。实验基于四个大型数据集，用于数据分类问题。结果表明，无论使用了多少个计算机节点，基于MapReduce的过程的分类性能都非常稳定，除了类不平衡数据集外，其性能优于基线单机和分布式过程。此外，MapReduce过程需要最少的计算成本来处理这些大数据集。

著录项

来源
《The Journal of Systems and Software 》 |2016年第12期| 83-92| 共10页
作者
Chih-Fong Tsai; Wei-Chao Lin; Shih-Wen Ke;
展开▼
作者单位

Department of Information Management, National Central University, Taiwan;

Department of Computer Science and Information Engineering, Asia University, Taiwan;

Department of Information and Computer Engineering, Chung Yuan Christian University, Taiwan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Big data; Data mining; Parallel computing; Distributed; Cloud computing; MapReduce;

机译：大数据;数据挖掘;并行计算分散式;云计算;MapReduce;

相似文献

外文文献
中文文献
专利

1. A special issue of Journal of Parallel and Distributed Computing: Models and algorithms for high-performance distributed data mining [J] . Alfredo Cuzzocrea Journal of Parallel and Distributed Computing . 2011 ,第5期

机译：《并行与分布式计算杂志》特刊：高性能分布式数据挖掘的模型和算法
2. An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining with XML Data for Improved Response Time [J] . Sujni Paul International Journal of Computer Science & Information Technology (IJCSIT) . 2010 ,第2期

机译：XML数据并行和分布式数据挖掘中的优化分布式关联规则挖掘算法，可提高响应时间
3. Mining of Association Rules on Large Database Using Distributed and Parallel Computing [J] . Anil Vasoya, Nitin Koli Procedia Computer Science . 2016 ,第1期

机译：使用分布式和并行计算的大型数据库关联规则的挖掘
4. A Paralleled Big Data Algorithm with MapReduce Framework for Mining Twitter Data [C] . Li Bing, Chan Keith C. C. 2014 IEEE Fourth International Conference on Big Data and Cloud Computing . 2014

机译：具有MapReduce框架的并行大数据算法，用于挖掘Twitter数据
5. Scalable parallel computing on clouds: Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations on cloud environments. [D] . Gunarathne, Thilina. 2014

机译：云上的可伸缩并行计算：高效且可伸缩的架构，可在云环境上执行令人满意的并行，MapReduce和迭代式数据密集型计算。
6. Methodologies for Medical Computing. Date Bases and Management Database Management: Smart Files: A Method of Managing Non-Deterministic Data for Multi-Tasking and Distributed Systems [O] . Paul D. Keltz, Catherine N. Pfeil, Melanie H. Okawachi, 1983

机译：医学计算方法。日期基础和管理数据库管理：智能文件：一种用于管理多任务和分布式系统的不确定数据的方法
7. A paralleled big data algorithm with mapreduce framework for mining twitter data [O] . Bing L, Chan KCC 2015

机译：带有mapreduce框架的并行大数据算法，用于挖掘Twitter数据
8. HPCC Methodologies for Structural Design and Analysis on Parallel and Distributed Computing Platforms [R] . Farhat, Charbel 1998

机译：HpCC并行和分布式计算平台结构设计和分析方法

Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

摘要

著录项

相似文献

相关主题

期刊订阅