首页> 外文会议>International Workshop on Embedded Multicore Systems >Software-Defined Data Shuffling for Big Data Jobs with Task Duplication
【24h】

Software-Defined Data Shuffling for Big Data Jobs with Task Duplication

机译:软件定义数据随机数据删除具有任务复制的大数据作业

获取原文

摘要

Big data jobs are usually executed on large-scale distributed computing platforms that automatically divide a job into multiple computation phases, each of which contains a number of independent tasks that can run in parallel. The data shuffling process between two consecutive phases becomes the bottleneck of job execution. To improve its performance, an approach of "push" shuffling is proposed to send intermediate results to next phase immediately once they are generated. It avoids local disk accesses in the traditional "pull" shuffling approach, and tasks in the next phase can start data processing without waiting tasks in the predecessive phase to finish. Task duplication is another approach to accelerate task execution by launching multiple task copies that compete for processing the same data block. When "push" shuffling meets task duplication, big data jobs can be significantly accelerated, but leading to a large amount of redundant data transmission between two phases. To address this challenge, we propose a software-define data shuffling approach by designing a controller and a janitor module to control the data shuffling process. Each task has a janitor that communicates with the controller to request admission permit of sending intermediate results to next-stage tasks. We further propose an online grouping algorithm to reduce the overhead of frequent communication with the controller. The performance of the proposed algorithm is evaluated by extensive simulations.
机译:大数据的工作通常是在大规模执行的分布式计算平台,可以自动分割作业分为多个运算阶段,每个都包含了一些可以并行运行独立的任务。连续两个阶段之间的数据洗牌过程变得作业执行的瓶颈。为了提高其性能,“推”洗牌的做法提出了立即发送中间结果到下一个阶段,一旦他们产生。它避免了传统的“拉”洗牌方法本地磁盘的访问,并在下一阶段的任务可以开始数据而不在predecessive阶段等待任务完成处理。任务重复是另一种方法通过启动对于处理相同的数据块竞争多任务副本,以加速执行任务。当“推”洗牌满足任务重复,大数据作业可以显著加快,但导致大量的冗余数据传输的两个阶段之间。为了应对这一挑战,我们提出通过设计一个控制器和一个看门模块来控制数据移动过程由软件定义数据洗牌方法。每个任务都有一个看门人,与所述控制器发送中间结果到下一级的任务的请求入场许可连通。我们进一步提出了一个在线的分组算法,以减少与控制器频繁通信的开销。该算法的性能是通过大量的模拟计算。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号