首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications
【24h】

Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications

机译:Athanasia:用于并行应用程序的用户透明和容错系统

获取原文
获取原文并翻译 | 示例
           

摘要

This article presents Athanasia, a user-transparent and fault-tolerant system, for parallel applications running on large-scale cluster systems. Cluster systems have been regarded as a de facto standard to achieve multitera-flop computing power. These cluster systems, as we know, have an inherent failure factor that can cause computation failure. The reliability issue in parallel computing systems, therefore, has been studied for a relatively long time in the literature, and we have seen many theoretical promises arise from the extensive research. However, despite the rigorous studies, practical and easily deployable fault-tolerant systems have not been successfully adopted commercially. Athanasia is a user-transparent checkpointing system for a fault-tolerant Message Passing Interface (MPI) implementation that is primarily based on the sync-and-stop protocol. Athanasia supports three critical functionalities that are necessary for fault tolerance: a light-weight failure detection mechanism, dynamic process management that includes process migration, and a consistent checkpoint and recovery mechanism. The main features of Athanasia are that it does not require any modifications to the application code and that it preserves many of the high performance characteristics of high-speed networks. Experimental results show that Athanasia can be a good candidate for practically deployable fault-tolerant systems in very-large and high-performance clusters and that its protocol can be applied to a variety of parallel communication libraries easily.
机译:本文介绍了Athanasia,这是一个用户透明且容错的系统,适用于在大型集群系统上运行的并行应用程序。群集系统已被视为事实上的标准,可以实现多兆位触发器的计算能力。众所周知,这些集群系统具有固有的故障因素,可能导致计算故障。因此,并行计算系统中的可靠性问题已经在文献中进行了相当长时间的研究,并且我们已经看到了广泛研究的许多理论前景。然而,尽管进行了严格的研究,但实用且易于部署的容错系统尚未在商业上成功采用。 Athanasia是一个用户透明的检查点系统,用于容错消息传递接口(MPI)实施,该系统主要基于同步停止协议。 Athanasia支持三个必需的关键功能,这些功能对于容错都是必需的:轻量级故障检测机制,包括流程迁移的动态流程管理以及一致的检查点和恢复机制。 Athanasia的主要特征是它不需要对应用程序代码进行任何修改,并且保留了高速网络的许多高性能特征。实验结果表明,Athanasia是非常适合大型和高性能集群中实际可部署的容错系统的候选者,并且其协议可以轻松地应用于各种并行通信库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号