基于增量式分区策略的 MapReduce 数据均衡方法

王卓; 陈群; 李战怀; 潘巍; 尤立

首页> 中文期刊> 《计算机学报》 >基于增量式分区策略的 MapReduce 数据均衡方法

基于增量式分区策略的 MapReduce 数据均衡方法

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

MapReduce 以其简洁的编程模型，被广泛应用于大规模和高维度数据集的处理，如日志分析、文档聚类和其他数据分析。开源系统 Hadoop 很好地实现了 MapReduce 模型，但由于自身采用一次分区机制，即通过 Hash／Range 分区函数对数据进行一次划分，导致在处理密集数据时，Reduce 端常会出现数据倾斜的问题。虽然系统为用户提供了自定义分区函数方法，但不幸的是在不清楚输入数据分布的情况下，数据倾斜问题很难被避免。为解决数据划分的不均衡，该文提出一种将分区向 Reducer 指派时按照多轮分配的分区策略。该方法首先在 Map 端产生多于 Reducer 个数的细粒度分区，同时在 Mapper 运行过程中实时统计各细粒度分区的数据量；然后由 JobTracker 根据全局的分区分布信息筛选出部分未分配的细粒度分区，并用代价评估模型将选中的细粒度分区分配到各 Reducer上；依照此方法，经过多轮的筛选、分配，最终在执行 Reduce（）函数前，将所有细粒度分区分配到 Reduce 端，以此解决分区后各 Reducer 接收数据总量均衡的问题。最后在 Zipf 分布数据集和真实数据集上与现有的分区切分方法Closer 进行了对比，增量式分区策略更好地解决了数据划分后的均衡问题。%MapReduce has been widely used in processing large data sets in a distributed cluster as a flexible computation model,such as log analysis,document clustering and other forms of data analytics.In the MapReduce open-source platform Hadoop,the default Hash/Range partition scheme usually results in unbalanced data load in the Reduce phase.Even though Hadoop allows users to define a partition function,it is difficult to achieve balanced data load without detailed information on data distribution.In this paper,we propose a novel multiple-round approach to balance data load in the Reduce phase.In our proposal,Mapper produces more fine-grained partitions than the number of Reducer and gathers the statistics on the sizes of fine-grained partitions.And then,JobTracker selects appropriate fine-grained partitions to be allocated to Reducers before running Reduce ()function.We introduce a cost model and propose a heuristic assignment algorithm for this task.Finally,we experimentally compare our approach with Closer, which uses a segment partition method,on both synthetic and real datasets.The experimental results show our method achieves more balanced data load.

著录项

来源
《计算机学报》 |2016年第1期|19-35|共17页
作者
王卓; 陈群; 李战怀; 潘巍; 尤立;
展开▼
作者单位

西北工业大学计算机学院西安 710072;

西北工业大学计算机学院西安 710072;

西北工业大学计算机学院西安 710072;

西北工业大学计算机学院西安 710072;

西北工业大学计算机学院西安 710072;

展开▼
原文格式 PDF
正文语种 chi
中图分类程序设计、软件工程;
关键词
增量分配; 细粒度分区; 数据倾斜; 均衡分区; MapReduce; 大数据;

相似文献

中文文献
外文文献
专利

1. 面向MapReduce的迭代式数据均衡分区策略 [J] . 张元鸣 ,蒋建波 ,陆佳炜 . 计算机学报 . 2019,第008期
2. 基于MapReduce框架的电力大数据增量式属性约简方法可行性分析 [J] . 郑筠 . 电子设计工程 . 2021,第003期
3. 基于反馈调度的MapReduce负载均衡分区算法研究 [J] . 刘寒梅 ,韩宏莹 . 信息通信 . 2015,第010期
4. 采用元组聚类的增量式数据分区方法 [J] . 吕晨 ,房俊 ,韩燕波 . 计算机科学与探索 . 2011,第008期
5. 数据本地性感知的MapReduce负载均衡策略 [J] . 李航晨 ,秦小麟 ,沈尧 . 计算机科学 . 2015,第010期
6. 基于压力反馈的MapReduce负载均衡策略 [C] . LI Hang-chen ,李航晨 ,QIN Xiao-lin . 2014湖北省计算机学会学术年会 . 2014
7. 面向MapReduce的中间数据分区策略与传输优化研究 [A] . 蒋建波 . 2019

基于增量式分区策略的 MapReduce 数据均衡方法

摘要

著录项

相似文献

相关主题

期刊订阅