首页> 外文会议>International conference on computational science >RADIC Based Fault Tolerance System with Dynamic Resource Controller
【24h】

RADIC Based Fault Tolerance System with Dynamic Resource Controller

机译:具有动态资源控制器的基于RADIC的容错系统

获取原文

摘要

The continuously growing High-Performance Computing requirements increments the number of components and at the same time failure probabilities. Long-running parallel applications are directly affected by this phenomena, disrupting its executions on failure occurrences. MPI, a well-known standard for parallel applications follows a fail-stop semantic, requiring the application owners restart the whole execution when hard failures appear losing time and computation data. Fault Tolerance (FT) techniques approach this issue by providing high availability to the users' applications execution, though adding significant resource and time costs. In this paper, we present a Fault Tolerance Manager (FTM) framework based on RADIC architecture, which provides FT protection to parallel applications implemented with MPI, in order to successfully complete executions despite failures. The solution is implemented in the application-layer following the uncoordinated and semi-coordinated rollback recovery protocols. It uses a sender-based message logger to store exchanged messages between the application processes; and checkpoints only the processes data required to restart them in case of failures. The solution uses the concepts of ULFM for failure detection and recovery. Furthermore, a dynamic resource controller is added to the proposal, which monitors the message logger buffers and performs actions to maintain an acceptable level of protection. Experimental validation verifies the FTM functionality using two private clusters infrastructures.
机译:不断增长的高性能计算要求增加了组件数量,同时增加了故障概率。长时间运行的并行应用程序会直接受到此现象的影响,从而在发生故障时中断其执行。 MPI是并行应用程序的众所周知的标准,遵循故障停止语义,要求应用程序所有者在出现严重故障而浪费时间和计算数据时重新启动整个执行。容错(FT)技术通过为用户的应用程序执行提供高可用性来解决此问题,尽管会增加大量的资源和时间成本。在本文中,我们提出了一个基于RADIC架构的容错管理器(FTM)框架,该框架为使用MPI实现的并行应用程序提供了FT保护,以便成功完成执行,即使失败。该解决方案遵循不协调和半协调的回滚恢复协议在应用程序层中实现。它使用基于发送者的消息记录器来存储应用程序进程之间交换的消息。并仅检查在发生故障时重新启动它们所需的过程数据。该解决方案使用ULFM的概念进行故障检测和恢复。此外,在提案中添加了动态资源控制器,该控制器监视消息记录器缓冲区并执行操作以维持可接受的保护级别。实验验证使用两个私有集群基础结构来验证FTM功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号