Software-Defined Data Shuffling for Big Data Jobs with Task Duplication

机译：软件定义数据随机数据删除具有任务复制的大数据作业

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Big data jobs are usually executed on large-scale distributed computing platforms that automatically divide a job into multiple computation phases, each of which contains a number of independent tasks that can run in parallel. The data shuffling process between two consecutive phases becomes the bottleneck of job execution. To improve its performance, an approach of "push" shuffling is proposed to send intermediate results to next phase immediately once they are generated. It avoids local disk accesses in the traditional "pull" shuffling approach, and tasks in the next phase can start data processing without waiting tasks in the predecessive phase to finish. Task duplication is another approach to accelerate task execution by launching multiple task copies that compete for processing the same data block. When "push" shuffling meets task duplication, big data jobs can be significantly accelerated, but leading to a large amount of redundant data transmission between two phases. To address this challenge, we propose a software-define data shuffling approach by designing a controller and a janitor module to control the data shuffling process. Each task has a janitor that communicates with the controller to request admission permit of sending intermediate results to next-stage tasks. We further propose an online grouping algorithm to reduce the overhead of frequent communication with the controller. The performance of the proposed algorithm is evaluated by extensive simulations.

机译：大数据的工作通常是在大规模执行的分布式计算平台，可以自动分割作业分为多个运算阶段，每个都包含了一些可以并行运行独立的任务。连续两个阶段之间的数据洗牌过程变得作业执行的瓶颈。为了提高其性能，“推”洗牌的做法提出了立即发送中间结果到下一个阶段，一旦他们产生。它避免了传统的“拉”洗牌方法本地磁盘的访问，并在下一阶段的任务可以开始数据而不在predecessive阶段等待任务完成处理。任务重复是另一种方法通过启动对于处理相同的数据块竞争多任务副本，以加速执行任务。当“推”洗牌满足任务重复，大数据作业可以显著加快，但导致大量的冗余数据传输的两个阶段之间。为了应对这一挑战，我们提出通过设计一个控制器和一个看门模块来控制数据移动过程由软件定义数据洗牌方法。每个任务都有一个看门人，与所述控制器发送中间结果到下一级的任务的请求入场许可连通。我们进一步提出了一个在线的分组算法，以减少与控制器频繁通信的开销。该算法的性能是通过大量的模拟计算。

著录项

来源
《International Workshop on Embedded Multicore Systems》|2016年|440p|共5页
会议地点
作者
Qimeng Zang; Hsiang-Yu Chan; Peng Li; Song Guo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.133.2-53;
关键词
Delays; Acceleration; Big data; Data communication; Distributed databases; Process control;

机译：延迟;加速;大数据;数据通信;分布式数据库;过程控制;

相似文献

外文文献
中文文献
专利

1. Online Shuffling with Task Duplication in Cloud [J] . ZANG Qimeng, GUO Song 中兴通讯技术（英文版） . 2017,第004期

机译：云中任务重复的在线改组
2. BigDataSDNSim: A simulator for analyzing big data applications in software-defined cloud data centers [J] . Alwasel Khaled, Calheiros Rodrigo N., Garg Saurabh, Software, practice & experience . 2021,第5期

机译：BigDataSDNSIM：用于分析软件定义云数据中心的大数据应用的模拟器
3. When big data meets software-defined networking: SDN for big data and big data for SDN [J] . Cui Laizhong, Yu F. Richard, Yan Qiao Network, IEEE . 2016,第1期

机译：当大数据满足软件定义的网络要求时：适用于大数据的SDN和适用于SDN的大数据
4. Software-Defined Data Shuffling for Big Data Jobs with Task Duplication [C] . Qimeng Zang, Hsiang-Yu Chan, Peng Li, International Conference on Parallel Processing Workshops . 2016

机译：具有任务复制功能的大数据作业的软件定义数据改组
5. Improving Data-Shuffle Performance in Data-Parallel Distributed Systems [D] . Samson, Shweelan. 2018

机译：在数据并行分布式系统中提高数据扫描性能
6. Scheduling multi-task jobs with extra utility in data centers [O] . Xiaolin Fang, Junzhou Luo, Hong Gao, -1

机译：在数据中心内利用额外的工具调度多任务作业
7. Ancient gene duplication and domain shuffling in the animal cyclic nucleotide phosphodiesterase family1The nucleotide sequence data reported in this paper will appear in the DDBJ, EMBL and GenBank nucleotide sequence databases with accession numbers AB017021–AB017024.1 [O] . Koyanagi Mitsumasa, Suga Hiroshi, Hoshiyama Daisuke, 1998

机译：动物环状核苷酸磷酸二酯酶家族中的古代基因复制和结构域改组1本文报道的核苷酸序列数据将出现在DDBJ，EMBL和GenBank核苷酸序列数据库中，登录号为AB017021–AB017024.1
8. Matrix and position correction of shuffler assays by application of the alternating conditional expectation algorithm to shuffler data [R] . Pickrell, M M, Rinard, P M 1992

机译：通过将交替条件期望算法应用于洗牌数据来进行洗牌分析的矩阵和位置校正

Software-Defined Data Shuffling for Big Data Jobs with Task Duplication

摘要

著录项

相似文献

相关主题

期刊订阅