首页> 外文会议>IEEE International Conference on Cluster Computing and Workshops >Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation
【24h】

Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation

机译:集群容错:通过模拟检查检查点和映射的实验评估

获取原文

摘要

Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity — namely MapReduce and Dryad, with runtime systems providing their own reexecute-based fault-tolerance mechanisms, but with no analysis of their failure characteristics. Another development is the availability of failure data spanning years for systems of significant size at Los Alamos National Labs (LANL), but the Time Between Failure (TBF) for these systems is a poor fit to the exponential distribution assumed by optimization work in checkpointing, bringing these results into question. The work in this paper describes a discrete event simulation driven by the LANL data and by models of parallel checkpointing and MapReduce tasks. The simulation allows us to then evaluate and assess the fault tolerance characteristics of these tasks with the goal of minimizing the expected running time of a parallel program in a cluster in the presence of faults for both fault tolerance models.
机译:传统上,集群计算已经采用检查点以解决容错。最近,并行应用的新模型已经在流行度中增长 - 即MapReduce和Dryad,运行时系统提供了自身的重新考虑的容错机制,但没有分析其失效特性。另一个开发是在LOS Alamos国家实验室(LANL)的重要规模的失败数据的可用性,但是这些系统的失败(TBF)之间的时间是一种糟糕的拟合,以通过检查点优化工作所假设的指数分布,将这些结果带入问题。本文中的工作描述了由LANL数据驱动的离散事件模拟,并通过并行检查点和MapReduce任务的模型。该模拟允许我们利用在存在故障容错模型的故障存在故障时最小化集群中并行程序的预期运行时间来评估和评估这些任务的容错特性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号