Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

Zeyu He; Qiuli Huang; Zhifang Li; Chuliang Weng

首页> 外文期刊>International journal of parallel programming >Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

【24h】

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

机译：使用任务窃取处理Spark SQL中的聚合数据偏移

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in Spark SQL. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible.

机译：在分布式内存计算系统中，数据分布对性能具有很大影响。设计良好的分区算法很困难，并且需要用户具有足够的数据先验知识，这使得数据偏差在现实中。通过采样和重新分区处理数据偏差的传统方法通常会产生额外的开销。在本文中，我们提出了聚合运算符的动态执行优化，它是火花SQL中最通用和昂贵的运算符之一。我们的优化旨在避免额外的开销，并在发生数据偏斜时提高性能。核心想法是任务窃取。基于数据分区的相对大小，我们添加了两种类型的任务，即逐个分区的段任务和窃取较小分区的任务。在舞台中，窃取任务可以在处理自己后积极窃取和处理数据从段任务中的数据。优化在不同尺寸和数据分布上实现显着性能高达67％。实验表明，涉及的开销是最小的并且可以忽略不计。

著录项

来源
《International journal of parallel programming》 |2020年第6期|941-956|共16页
作者
Zeyu He; Qiuli Huang; Zhifang Li; Chuliang Weng;
展开▼
作者单位

School of Data Science and Engineering East China Normal University Shanghai China;

School of Data Science and Engineering East China Normal University Shanghai China;

School of Data Science and Engineering East China Normal University Shanghai China;

School of Data Science and Engineering East China Normal University Shanghai China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
In-memory computing; Spark SQL; Aggregation; Data skew;

机译：内存计算;Spark SQL;聚合;数据歪斜;

相似文献

外文文献
中文文献
专利

1. Handling data skew at reduce stage in Spark by ReducePartition [J] . Concurrency, practice and experience . 2020,第9期

机译：通过ReducePartition在Spark的reduce阶段处理数据偏斜
2. SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming [J] . Guipeng Liu, Xiaomin Zhu, Ji Wang, Future generation computer systems . 2018,第SEPa期

机译：SP-Partitioner：一种新颖的分区方法，用于处理火花流中的中间数据偏斜
3. Scheduling Spark Tasks With Data Skew and Deadline Constraints [J] . Haihua Gu, Xiaoping Li, Zhipeng Lu Quality Control, Transactions . 2021,第1期

机译：使用数据偏差和截止日期约束调度火花任务
4. Idempotent Task Cache System for Handling Intermediate Data Skew in MapReduce on Cloud Computing [C] . Tzu-Chi Huang, Kuo-Chih Chu, Jia-Hui Lin, 2016 International Computer Symposium . 2016

机译：在云计算上处理MapReduce中的中间数据偏斜的幂等任务缓存系统
5. Performance analysis of scalable SQL and NoSQL databases: A quantitative approach. [D] . Balasubramanian, Harish. 2014

机译：可伸缩SQL和NoSQL数据库的性能分析：一种定量方法。
6. An adaptive spark-based framework for querying large-scale NoSQL and relational databases [O] . Eman Khashan, Ali Eldesouky, Sally Elghamrawy 2021

机译：用于查询大型NoSQL和关系数据库的自适应火花基框架
7. Migration from relational database like MySQL to nosql database like Cassandra is necessary and how to migrate it using spark [O] . Dr. Kishor Atkotiya, Parag Shukla 2016

机译：像MySQL这样的关系数据库迁移，如Cassandra等NoSQL数据库是必要的，并且如何使用Spark迁移它

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅