...
首页> 外文期刊>The Journal of Systems and Software >Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies
【24h】

Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

机译:利用并行计算进行大数据挖掘:分布式方法与MapReduce方法的比较

获取原文
获取原文并翻译 | 示例

摘要

Mining with big data or big data mining has become an active research area. It is very difficult using current methodologies and data mining software tools for a single personal computer to efficiently deal with very large datasets. The parallel and cloud computing platforms are considered a better solution for big data mining. The concept of parallel computing is based on dividing a large problem into smaller ones and each of them is carried out by one single processor individually. In addition, these processes are performed concurrently in a distributed and parallel manner. There are two common methodologies used to tackle the big data problem. The first one is the distributed procedure based on the data parallelism paradigm, where a given big dataset can be manually divided into n subsets, and n algorithms are respectively executed for the corresponding n subsets. The final result can be obtained from a combination of the outputs produced by the n algorithms. The second one is the MapReduce based procedure under the cloud computing platform. This procedure is composed of the map and reduce processes, in which the former performs filtering and sorting and the later performs a summary operation in order to produce the final result In this paper, we aim to compare the performance differences between the distributed and MapReduce methodologies over large scale datasets in terms of mining accuracy and efficiency. The experiments are based on four large scale datasets, which are used for the data classification problems. The results show that the classification performances of the MapReduce based procedure are very stable no matter how many computer nodes are used, better than the baseline single machine and distributed procedures except for the class imbalance dataset. In addition, the MapReduce procedure requires the least computational cost to process these big datasets.
机译:大数据挖掘或大数据挖掘已成为一个活跃的研究领域。对于一台个人计算机,使用当前的方法和数据挖掘软件工具很难有效地处理非常大的数据集。并行和云计算平台被认为是大数据挖掘的更好解决方案。并行计算的概念基于将大问题分解为较小的问题,并且每个问题均由一个处理器单独执行。另外,这些处理以分布式和并行的方式同时执行。有两种用于解决大数据问题的常用方法。第一个是基于数据并行性范式的分布式过程,其中可以将给定的大数据集手动划分为n个子集,并对相应的n个子集分别执行n个算法。最终结果可以从n种算法产生的输出的组合中获得。第二个是云计算平台下基于MapReduce的过程。该过程由映射和归约过程组成,其中前者执行过滤和排序,而后者执行汇总操作以产生最终结果。本文旨在比较分布式方法与MapReduce方法之间的性能差异挖掘准确性和效率方面的大规模数据集。实验基于四个大型数据集,用于数据分类问题。结果表明,无论使用了多少个计算机节点,基于MapReduce的过程的分类性能都非常稳定,除了类不平衡数据集外,其性能优于基线单机和分布式过程。此外,MapReduce过程需要最少的计算成本来处理这些大数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号