首页> 外文期刊>International journal of parallel programming >Handling Data Skew for Aggregation in Spark SQL Using Task Stealing
【24h】

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

机译:使用任务窃取处理Spark SQL中的聚合数据偏移

获取原文
获取原文并翻译 | 示例

摘要

In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in Spark SQL. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible.
机译:在分布式内存计算系统中,数据分布对性能具有很大影响。设计良好的分区算法很困难,并且需要用户具有足够的数据先验知识,这使得数据偏差在现实中。通过采样和重新分区处理数据偏差的传统方法通常会产生额外的开销。在本文中,我们提出了聚合运算符的动态执行优化,它是火花SQL中最通用和昂贵的运算符之一。我们的优化旨在避免额外的开销,并在发生数据偏斜时提高性能。核心想法是任务窃取。基于数据分区的相对大小,我们添加了两种类型的任务,即逐个分区的段任务和窃取较小分区的任务。在舞台中,窃取任务可以在处理自己后积极窃取和处理数据从段任务中的数据。优化在不同尺寸和数据分布上实现显着性能高达67%。实验表明,涉及的开销是最小的并且可以忽略不计。

著录项

  • 来源
    《International journal of parallel programming》 |2020年第6期|941-956|共16页
  • 作者单位

    School of Data Science and Engineering East China Normal University Shanghai China;

    School of Data Science and Engineering East China Normal University Shanghai China;

    School of Data Science and Engineering East China Normal University Shanghai China;

    School of Data Science and Engineering East China Normal University Shanghai China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    In-memory computing; Spark SQL; Aggregation; Data skew;

    机译:内存计算;Spark SQL;聚合;数据歪斜;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号