首页> 外文会议>IEEE International Conference on Big Data Computing Service and Applications >Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework
【24h】

Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework

机译:基于MPI的MapReduce数据计算框架的容错机制建模和设计

获取原文

摘要

Fault-tolerance is a significant property for distributed and parallel computing systems. An emerging trend of Big Data computing is to combine MPI and MapReduce technologies in a single framework. The distinctive state model in this kind of frameworks brings challenges to designing an efficient and transparent fault-tolerance mechanism. In this paper, a state model analysis method is proposed for uniformly modeling independent MPI, MapReduce and MPI-based MapReduce data computing frameworks. Based on this analysis, a library-level fault-tolerance mechanism with global persistent state model is proposed, a data-staging and routine-sharing based checkpoint approach is designed within this mechanism. The proposed mechanism has been implemented in DataMPI, a communication library supporting MPI-based MapReduce data computing applications. The experiments show that it can transparently enable fault-tolerance for applications. Taking TeraSort as an example, it introduces only 6.8% time overhead and 11% space overhead. For a failure-resume execution, it has a 10%-32% performance advantage compared with the naive checkpoint solutions based on local or parallel storages. The proposed mechanism also provides superior performance and resource utilization compared with Hadoop for both fault-free and failure-resume executions.
机译:容错是分布式和并行计算系统的重要属性。大数据计算的新兴趋势是将MPI和MapReduce技术结合在一个框架中。这种框架中独特的状态模型给设计高效且透明的容错机制带来了挑战。本文提出了一种状态模型分析方法,对统一的MPI,MapReduce和基于MPI的MapReduce数据计算框架进行统一建模。在此基础上,提出了一种具有全局持久状态模型的库级容错机制,并在该机制中设计了一种基于数据分段和例行共享的检查点方法。所提出的机制已在支持基于MPI的MapReduce数据计算应用程序的通信库DataMPI中实现。实验表明,它可以透明地为应用程序启用容错功能。以TeraSort为例,它仅引入6.8%的时间开销和11%的空间开销。与基于本地或并行存储的幼稚检查点解决方案相比,对于执行故障恢复执行,它具有10%-32%的性能优势。与Hadoop相比,所提出的机制在无故障执行和故障恢复执行方面也提供了卓越的性能和资源利用率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号