Handling data skew in join algorithms using MapReduce

Myung Jaeseok; Shim Junho; Yeon Jongheum; Lee Sang-goo

首页> 外文期刊>Expert Systems with Application >Handling data skew in join algorithms using MapReduce

【24h】

Handling data skew in join algorithms using MapReduce

机译：使用MapReduce处理联接算法中的数据偏斜

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

One of the major obstacles hindering effective join processing on MapReduce is data skew. Since MapReduce's basic hash-based partitioning method cannot solve the problem properly, two alternatives have been proposed: range-based and randomized methods. However, they still remain some drawbacks: the range-based method does not handle join product skew, and the randomized method performs worse than the basic hash-based partitioning when input relations are not skewed. In this paper, we present a new skew handling method, called multi-dimensional range partitioning (MDRP). The proposed method overcomes the limitations of traditional algorithms in two ways: 1) the number of output records expected at each, machine is considered, which leads to better handling of join product skew, and 2) a small number of input records are sampled before the actual join begins so that an efficient execution plan considering the degree of data skew can be created. As a result, in a scalar skew experiment, the proposed join algorithm is about 6.76 times faster than the range-based algorithm when join product skew exists and about 5.14 times than the randomized algorithm when input relations are not skewed. Moreover, through the worst-case analysis, we show that the input and the output imbalances are less than or equal to 2. The proposed algorithm does not require any modification to the original MapReduce environment and is applicable to complex join operations such as theta joins and multi-way joins. (C) 2016 Elsevier Ltd. All rights reserved.

机译：阻碍MapReduce上有效联接处理的主要障碍之一是数据偏斜。由于MapReduce的基本基于散列的分区方法无法正确解决该问题，因此提出了两种选择：基于范围的方法和随机方法。但是，它们仍然存在一些缺陷：基于范围的方法不能处理联接乘积的偏斜，并且当输入关系不偏斜时，随机方法的性能比基于基本哈希的分区还要差。在本文中，我们提出了一种新的偏斜处理方法，称为多维范围分割（MDRP）。所提出的方法通过两种方式克服了传统算法的局限性：1）考虑了每台机器预期的输出记录数，从而可以更好地处理连接产品的偏斜，以及2）在对输入记录进行抽样之前实际的连接开始，因此可以创建考虑数据偏斜程度的有效执行计划。结果，在标量偏斜实验中，当存在连接积偏斜时，提出的连接算法比基于范围的算法快约6.76倍，而当输入关系不偏斜时，则比随机算法快约5.14倍。此外，通过最坏情况分析，我们表明输入和输出不平衡小于或等于2。建议的算法不需要对原始MapReduce环境进行任何修改，并且适用于诸如theta joins之类的复杂联接操作和多路联接。（C）2016 Elsevier Ltd.保留所有权利。

著录项

来源
《Expert Systems with Application》 |2016年第6期|286-299|共14页
作者
Myung Jaeseok; Shim Junho; Yeon Jongheum; Lee Sang-goo;
展开▼
作者单位

Samsung Elect Co Ltd, Corp Design Ctr, Seoul, South Korea|Seoul Natl Univ, Seoul 151, South Korea;

Sookmyung Womens Univ, Div Comp Sci, Seoul, South Korea;

Seoul Natl Univ, Sch Comp Sci & Engn, Seoul 151, South Korea;

Seoul Natl Univ, Sch Comp Sci & Engn, Seoul 151, South Korea;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
MapReduce; Join algorithm; Skew handling; Multi-dimensional range partitioning;

机译：MapReduce;Join算法;倾斜处理;多维范围划分;

相似文献

外文文献
中文文献
专利

1. SharesSkew: An algorithm to handle skew for joins in MapReduce [J] . Afrati Foto N., Stasinopoulos Nikos, Ullman Jeffrey D., Information Systems . 2018,第SEPa期

机译：SharesSkew：MapReduce中用于处理连接倾斜的算法
2. Learning automata-based algorithms for MapReduce data skewness handling [J] . Irandoost Mohammad Amin, Rahmani Amir Masoud, Setayeshi Saeed Journal of supercomputing . 2019,第10期

机译：学习基于自动机的MapReduce数据偏度处理算法
3. Load balancing in join algorithms for skewed data in MapReduce systems [J] . Gavagsaz Elaheh, Rezaee Ali, Javadi Hamid Haj Seyyed Journal of supercomputing . 2019,第1期

机译：MapReduce系统中偏斜数据的联接算法中的负载平衡
4. SAND Join — A skew handling join algorithm for Google's MapReduce framework [C] . Atta Fariha, Viglas Stratis D., Niazi Salman Proceedings of the 2011 14th IEEE International Multitopic Conference . 2011

机译：SAND Join — Google MapReduce框架的倾斜处理联接算法
5. Search and Join Algorithms for Tables in Data Lakes [D] . Zhu, Erkang. 2019

机译：搜索和加入数据湖泊中表的算法
6. Handling Data Skew in MapReduce Cluster by Using Partition Tuning [O] . Yufei Gao, Yanjie Zhou, Bing Zhou, 2017

机译：使用分区调整处理MapReduce群集中的数据偏斜
7. SharesSkew: An Algorithm to Handle Skew for Joins in MapReduce [O] . Afrati, Foto, Stasinopoulos, Nikos, Ullman, Jeffrey D., 2015

机译：sharesskew：一种处理mapReduce中连接偏斜的算法

Handling data skew in join algorithms using MapReduce

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅