首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >EC-Shuffle: Dynamic Erasure Coding Optimization for Efficient and Reliable Shuffle in Spark
【24h】

EC-Shuffle: Dynamic Erasure Coding Optimization for Efficient and Reliable Shuffle in Spark

机译:EC-Shuffle:动态擦除编码优化,可实现火花中高效可靠的随机播放

获取原文

摘要

Fault-tolerance capabilities attract increasing attention from existing data processing frameworks, such as Apache Spark. To avoid replaying costly distributed computation, like shuffle, local checkpoint and remote replication are two popular approaches. They incur significant runtime overhead, such as extra storage cost or network traffic. Erasure coding is another emerging technology, which also enables data resilience. It is perceived as capable of replacing the checkpoint and replication mechanisms for its high storage efficiency. However, it suffers heavy network traffic due to distributing data partitions to different locations. In this paper, we propose EC-Shuffle with two encoding schemes and optimize the shuffle-based operations in Spark or MapReduce-like frameworks. Specifically, our encoding schemes concentrate on optimizing the data traffic during the execution of shuffle operations. They only transfer the parity chunks generated via erasure coding, instead of a whole copy of all data chunks. EC-Shuffle also provides a strategy, which can dynamically select the per-shuffle biased encoding scheme according to the number of senders and receivers in each shuffle. Our analyses indicate that this dynamic encoding selection can minimize the total size of parity chunks. The extensive experimental results using BigDataBench with hundreds of mappers and reducers shows this optimization can reduce up to 50% network traffic and achieve up to 38% performance improvement.
机译:容错功能引起了诸如Apache Spark之类的现有数据处理框架的越来越多的关注。为避免重播昂贵的分布式计算(如随机播放,本地检查点和远程复制),这是两种流行的方法。它们会导致大量的运行时开销,例如额外的存储成本或网络流量。擦除编码是另一种新兴技术,它还可以实现数据弹性。它被认为能够取代检查点和复制机制,因为它具有很高的存储效率。但是,由于将数据分区分配到不同的位置,它遭受了繁重的网络流量。在本文中,我们提出了具有两种编码方案的EC-Shuffle,并在Spark或类似MapReduce的框架中优化了基于随机播放的操作。具体来说,我们的编码方案专注于在随机播放操作执行期间优化数据流量。它们仅传输通过擦除编码生成的奇偶校验块,而不是所有数据块的完整副本。 EC-Shuffle还提供了一种策略,该策略可以根据每个shuffle中的发送者和接收者的数量动态选择针对每个shuffle的编码方案。我们的分析表明,这种动态编码选择可以使奇偶校验块的总大小最小化。使用BigDataBench和数百个映射器和精简器进行的广泛实验结果表明,此优化可以减少多达50%的网络流量,并实现多达38%的性能提升。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号