首页> 外文期刊>Expert Systems with Application >Handling data skew in join algorithms using MapReduce
【24h】

Handling data skew in join algorithms using MapReduce

机译:使用MapReduce处理联接算法中的数据偏斜

获取原文
获取原文并翻译 | 示例

摘要

One of the major obstacles hindering effective join processing on MapReduce is data skew. Since MapReduce's basic hash-based partitioning method cannot solve the problem properly, two alternatives have been proposed: range-based and randomized methods. However, they still remain some drawbacks: the range-based method does not handle join product skew, and the randomized method performs worse than the basic hash-based partitioning when input relations are not skewed. In this paper, we present a new skew handling method, called multi-dimensional range partitioning (MDRP). The proposed method overcomes the limitations of traditional algorithms in two ways: 1) the number of output records expected at each, machine is considered, which leads to better handling of join product skew, and 2) a small number of input records are sampled before the actual join begins so that an efficient execution plan considering the degree of data skew can be created. As a result, in a scalar skew experiment, the proposed join algorithm is about 6.76 times faster than the range-based algorithm when join product skew exists and about 5.14 times than the randomized algorithm when input relations are not skewed. Moreover, through the worst-case analysis, we show that the input and the output imbalances are less than or equal to 2. The proposed algorithm does not require any modification to the original MapReduce environment and is applicable to complex join operations such as theta joins and multi-way joins. (C) 2016 Elsevier Ltd. All rights reserved.
机译:阻碍MapReduce上有效联接处理的主要障碍之一是数据偏斜。由于MapReduce的基本基于散列的分区方法无法正确解决该问题,因此提出了两种选择:基于范围的方法和随机方法。但是,它们仍然存在一些缺陷:基于范围的方法不能处理联接乘积的偏斜,并且当输入关系不偏斜时,随机方法的性能比基于基本哈希的分区还要差。在本文中,我们提出了一种新的偏斜处理方法,称为多维范围分割(MDRP)。所提出的方法通过两种方式克服了传统算法的局限性:1)考虑了每台机器预期的输出记录数,从而可以更好地处理连接产品的偏斜,以及2)在对输入记录进行抽样之前实际的连接开始,因此可以创建考虑数据偏斜程度的有效执行计划。结果,在标量偏斜实验中,当存在连接积偏斜时,提出的连接算法比基于范围的算法快约6.76倍,而当输入关系不偏斜时,则比随机算法快约5.14倍。此外,通过最坏情况分析,我们表明输入和输出不平衡小于或等于2。建议的算法不需要对原始MapReduce环境进行任何修改,并且适用于诸如theta joins之类的复杂联接操作和多路联接。 (C)2016 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号