Software-Defined Data Shuffling for Big Data Jobs with Task Duplication

机译：具有任务复制功能的大数据作业的软件定义数据改组

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Big data jobs are usually executed on large-scale distributed computing platforms that automatically divide a job into multiple computation phases, each of which contains a number of independent tasks that can run in parallel. The data shuffling process between two consecutive phases becomes the bottleneck of job execution. To improve its performance, an approach of "push" shuffling is proposed to send intermediate results to next phase immediately once they are generated. It avoids local disk accesses in the traditional "pull" shuffling approach, and tasks in the next phase can start data processing without waiting tasks in the predecessive phase to finish. Task duplication is another approach to accelerate task execution by launching multiple task copies that compete for processing the same data block. When "push" shuffling meets task duplication, big data jobs can be significantly accelerated, but leading to a large amount of redundant data transmission between two phases. To address this challenge, we propose a software-define data shuffling approach by designing a controller and a janitor module to control the data shuffling process. Each task has a janitor that communicates with the controller to request admission permit of sending intermediate results to next-stage tasks. We further propose an online grouping algorithm to reduce the overhead of frequent communication with the controller. The performance of the proposed algorithm is evaluated by extensive simulations.

机译：大数据作业通常在大型分布式计算平台上执行，该平台将作业自动划分为多个计算阶段，每个阶段包含许多可以并行运行的独立任务。两个连续阶段之间的数据改组过程成为作业执行的瓶颈。为了提高其性能，提出了一种“推”改组的方法，以便在生成中间结果后立即将其发送到下一个阶段。它避免了传统的“拉”式改组方法中的本地磁盘访问，并且下一阶段的任务可以开始数据处理，而无需等待先前阶段的任务完成。任务复制是通过启动多个竞争处理同一数据块的任务副本来加速任务执行的另一种方法。当“推送”改组遇到任务重复时，可以大大加速大数据作业，但会导致两个阶段之间大量冗余数据传输。为了解决这一挑战，我们提出了一种软件定义的数据改组方法，方法是设计控制器和管理员模块来控制数据改组过程。每个任务都有一个与控制器通信的看门人，以请求允许将中间结果发送给下一阶段任务的准入许可。我们进一步提出了一种在线分组算法，以减少与控制器频繁通信的开销。通过广泛的仿真评估了所提出算法的性能。

著录项

来源
《International Conference on Parallel Processing Workshops》|2016年|403-407|共5页
会议地点
作者
Qimeng Zang; Hsiang-Yu Chan; Peng Li; Song Guo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Delays; Acceleration; Big data; Data communication; Distributed databases; Process control;

机译：延迟;加速;大数据;数据通信;分布式数据库;过程控制;

相似文献

外文文献
中文文献
专利

1. Online Shuffling with Task Duplication in Cloud [J] . ZANG Qimeng, GUO Song 中兴通讯技术（英文版） . 2017,第004期

机译：云中任务重复的在线改组
2. BigDataSDNSim: A simulator for analyzing big data applications in software-defined cloud data centers [J] . Alwasel Khaled, Calheiros Rodrigo N., Garg Saurabh, Software, practice & experience . 2021,第5期

机译：BigDataSDNSIM：用于分析软件定义云数据中心的大数据应用的模拟器
3. When big data meets software-defined networking: SDN for big data and big data for SDN [J] . Cui Laizhong, Yu F. Richard, Yan Qiao Network, IEEE . 2016,第1期

机译：当大数据满足软件定义的网络要求时：适用于大数据的SDN和适用于SDN的大数据
4. Software-Defined Data Shuffling for Big Data Jobs with Task Duplication [C] . Qimeng Zang, Hsiang-Yu Chan, Peng Li, International Workshop on Embedded Multicore Systems . 2016

机译：软件定义数据随机数据删除具有任务复制的大数据作业
5. Improving Data-Shuffle Performance in Data-Parallel Distributed Systems [D] . Samson, Shweelan. 2018

机译：在数据并行分布式系统中提高数据扫描性能
6. Scheduling multi-task jobs with extra utility in data centers [O] . Xiaolin Fang, Junzhou Luo, Hong Gao, -1

机译：在数据中心内利用额外的工具调度多任务作业
7. Ancient gene duplication and domain shuffling in the animal cyclic nucleotide phosphodiesterase family1The nucleotide sequence data reported in this paper will appear in the DDBJ, EMBL and GenBank nucleotide sequence databases with accession numbers AB017021–AB017024.1 [O] . Koyanagi Mitsumasa, Suga Hiroshi, Hoshiyama Daisuke, 1998

机译：动物环状核苷酸磷酸二酯酶家族中的古代基因复制和结构域改组1本文报道的核苷酸序列数据将出现在DDBJ，EMBL和GenBank核苷酸序列数据库中，登录号为AB017021–AB017024.1
8. Matrix and position correction of shuffler assays by application of the alternating conditional expectation algorithm to shuffler data [R] . Pickrell, M M, Rinard, P M 1992

机译：通过将交替条件期望算法应用于洗牌数据来进行洗牌分析的矩阵和位置校正

Software-Defined Data Shuffling for Big Data Jobs with Task Duplication

摘要

著录项

相似文献

相关主题

期刊订阅