首页> 外文期刊>Journal of supercomputing >An improved partitioning mechanism for optimizing massive data analysis using MapReduce
【24h】

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

机译:使用MapReduce优化海量数据分析的改进分区机制

获取原文
获取原文并翻译 | 示例

摘要

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partition-ing mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.
机译:在大数据时代,无数无处不在的数据源每天都产生大量的结构化和非结构化数据。大数据难以使用,并且需要在大量计算机上运行的大规模并行软件。 MapReduce是最新的编程模型,可简化编写处理大数据的分布式应用程序的过程。为了使MapReduce能够正常工作,它必须在网络中的计算机之间分配工作负载。因此,MapReduce的性能在很大程度上取决于它如何均匀分布此工作负载。这可能是一个挑战,尤其是在数据偏斜出现时。在MapReduce中,工作负载分配取决于对数据进行分区的算法。避免数据倾斜所固有的问题的一种方法是使用数据采样。分区器如何均匀地分布数据取决于样本的大小和代表性以及通过分区机制对样本的分析程度。本文提出了一种改进的分区算法,可以改善负载平衡和内存消耗。这是通过改进的采样算法和分区器完成的。为了评估提出的算法,将其性能与TeraSort所采用的最新分区机制进行了比较。实验表明,与当前实现相比,该算法速度更快,存储效率更高,准确性更高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号