首页> 外文期刊>Multiagent and grid systems >Handling data skew in joins based on cluster cost partitioning for MapReduce
【24h】

Handling data skew in joins based on cluster cost partitioning for MapReduce

机译:基于MapReduce的集群成本划分处理联接中的数据偏斜

获取原文
获取原文并翻译 | 示例

摘要

Data skew in parallel joins results in poor load balancing which can lead to significantly varying execution times for the reducers in MapReduce. The performance of join operation is severely degraded in the presence of heavy skew in the datasets to be joined. Previous work mainly focuses on either input or output load imbalance among reducers, which is ineffective for load balancing. In this paper, we present a new data skew handling method based on Cluster Cost Partitioning (CCP) for optimizing parallel joins in MapReduce. A new cost model which considers the properties of both input and output is defined to estimate the cost of the parallel join. CCP employs clusters instead of join keys from input relations to create join matrix. Using the cost model, CCP identifies and splits heavy cells in the cluster join matrix. Then CCP assigns a set of non-heavy cells to reducers for join load-balancing. For different applications, the input and output weight values in the cost model could be dynamically adjusted to depict the join costs more precisely. The experimental results demonstrate that CCP achieves a more accurate load balancing result among reducers.
机译:并行联接中的数据偏斜会导致负载平衡不佳,这可能导致MapReduce中的reducer的执行时间发生明显变化。如果要连接的数据集中存在严重的偏斜,则连接操作的性能会严重降低。先前的工作主要集中在减速器之间的输入或输出负载不平衡,这对负载平衡无效。在本文中,我们提出了一种基于聚类成本划分(CCP)的新数据偏斜处理方法,用于优化MapReduce中的并行联接。定义了一个同时考虑输入和输出属性的新成本模型,以估计并行连接的成本。 CCP使用群集代替输入关系中的联接键来创建联接矩阵。使用成本模型,CCP可以识别并拆分集群连接矩阵中的重单元。然后,CCP将一组非重单元分配给减速器以进行连接负载平衡。对于不同的应用程序,可以动态调整成本模型中的输入和输出权重值,以更精确地描述合并成本。实验结果表明,CCP在减速器之间实现了更准确的负载均衡结果。

著录项

  • 来源
    《Multiagent and grid systems》 |2018年第1期|103-123|共21页
  • 作者单位

    Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu, Sichuan, China,University of Chinese Academy of Sciences, Beijing, China;

    Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu, Sichuan, China;

    Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu, Sichuan, China,University of Chinese Academy of Sciences, Beijing, China;

    Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang, Guizhou, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Data skew; load balance; join algorithm; cluster cost partitioning; MapReduce;

    机译:数据偏斜;负载均衡;联合算法;集群成本划分;MapReduce;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号