Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation

机译：集群容错：通过模拟检查检查点和映射的实验评估

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity — namely MapReduce and Dryad, with runtime systems providing their own reexecute-based fault-tolerance mechanisms, but with no analysis of their failure characteristics. Another development is the availability of failure data spanning years for systems of significant size at Los Alamos National Labs (LANL), but the Time Between Failure (TBF) for these systems is a poor fit to the exponential distribution assumed by optimization work in checkpointing, bringing these results into question. The work in this paper describes a discrete event simulation driven by the LANL data and by models of parallel checkpointing and MapReduce tasks. The simulation allows us to then evaluate and assess the fault tolerance characteristics of these tasks with the goal of minimizing the expected running time of a parallel program in a cluster in the presence of faults for both fault tolerance models.

机译：传统上，集群计算已经采用检查点以解决容错。最近，并行应用的新模型已经在流行度中增长 - 即MapReduce和Dryad，运行时系统提供了自身的重新考虑的容错机制，但没有分析其失效特性。另一个开发是在LOS Alamos国家实验室（LANL）的重要规模的失败数据的可用性，但是这些系统的失败（TBF）之间的时间是一种糟糕的拟合，以通过检查点优化工作所假设的指数分布，将这些结果带入问题。本文中的工作描述了由LANL数据驱动的离散事件模拟，并通过并行检查点和MapReduce任务的模型。该模拟允许我们利用在存在故障容错模型的故障存在故障时最小化集群中并行程序的预期运行时间来评估和评估这些任务的容错特性。

著录项

来源
《IEEE International Conference on Cluster Computing and Workshops》|2009年||共10页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Analysis and evaluation of MapReduce solutions on an HPC cluster [J] . Veiga Jorge, Exposito Roberto R., Taboada Guillermo L., Computers and Electrical Engineering . 2016,第Null期

机译：HPC集群上MapReduce解决方案的分析和评估
2. Modelling, simulation, and experimental evaluation of a crossflow heat exchanger for an aircraft environmental control systemModelling, simulation, and experimental evaluation of a crossflow heat exchanger for an aircraft environmental control system [J] . S Shah, G Liu, D R Greatrix Proceedings of the Institution of Mechanical Engineers . 2010,第g5期

机译：飞机环境控制系统横流热交换器的建模，仿真和实验评估飞机环境控制系统横流热交换器的建模，仿真和实验评估
3. Multiprogrammed non-blocking checkpoints in support of optimistic simulation on myrinet clusters [J] . Andrea Santoro, Francesco Quaglia Journal of systems architecture . 2007,第9期

机译：多程序非阻塞检查点，支持对Myrinet群集进行乐观模拟
4. Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation [C] . IEEE International Conference on Cluster Computing and Workshops . 2009

机译：集群容错：通过模拟检查检查点和映射的实验评估
5. Evaluating MapReduce System Performance: A Simulation Approach. [D] . Wang, Guanying. 2012

机译：评估MapReduce系统性能：一种仿真方法。
6. Experimental evaluation of single‐domain antibodies predicted by molecular dynamics simulations to have elevated thermal stability [O] . Dan Zabetakis, Lisa C. Shriver‐Lake, Mark A. Olson, 2019

机译：通过分子动力学模拟预测的单结构域抗体的实验评价具有升高的热稳定性
7. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups [O] . Guanying Wang, Ali R. Butt, Prashant P, 2009

机译：MapReduce设置中评估设计决策的仿真方法
8. Experimental Evaluation of Sobriety Checkpoint Programs [R] . Stuster, J. W., Blowers, P. A. 1995

机译：清醒检查点程序的实验评估

Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation

摘要

著录项

相似文献

相关主题

期刊订阅