首页> 外文期刊>Future generation computer systems >Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
【24h】

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

机译:基于复制的软错误检测和自动恢复结合不同级别的检查点

获取原文
获取原文并翻译 | 示例

摘要

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.
机译:处理过错是HPC越来越担心。在未来的Exasgale系统中,预计将静音未检测到的错误每天发生几次,增加了腐败结果的发生。在本文中,我们提出Sedar,这是一种方法,即在运行并行消息传递应用程序时提高对瞬态故障的系统可靠性。我们的方法,基于用于检测的过程复制,与不同级别的检查点进行自动恢复,具有帮助科学应用程序的用户来获得具有正确结果的执行。 Sedar结构三个级别:(1)仅通过通知检测和安全停止; (2)基于多系统级别检查点的恢复; (3)基于单个有效的用户级检查点的恢复。由于这些变体中的每一个都提供了特定的覆盖范围,但涉及限制和实施成本,Sedar可以适应系统的需求。在这项工作中,提出了对方法的描述,并且在缺乏和存在的情况下,在数学上描述使用每个塞达尔策略的时间行为。引入了一种模型,用于考虑测试应用程序的所有故障方案以显示检测和恢复机制的有效性。对每个变体的开销评估是使用涉及不同通信模式的应用来执行;这也用于提取关于何时利用每个塞达尔保护水平的准则。因此,我们展示了其有效性和可行性,可容忍目标HPC环境中的瞬态断层。

著录项

  • 来源
    《Future generation computer systems》 |2020年第12期|240-254|共15页
  • 作者单位

    III-LIDI - lnstituto de Investigation en Informatica LIDI Facultad de Informatica Universidad National de La Plata La Plata Buenos Aires Argentina;

    III-LIDI - lnstituto de Investigation en Informatica LIDI Facultad de Informatica Universidad National de La Plata La Plata Buenos Aires Argentina;

    III-LIDI - lnstituto de Investigation en Informatica LIDI Facultad de Informatica Universidad National de La Plata La Plata Buenos Aires Argentina;

    III-LIDI - lnstituto de Investigation en Informatica LIDI Facultad de Informatica Universidad National de La Plata La Plata Buenos Aires Argentina;

    CAOS - Computer Architecture and Operating Systems Universidad Autonoma de Barcelona Barcelona Spain;

    CAOS - Computer Architecture and Operating Systems Universidad Autonoma de Barcelona Barcelona Spain;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Soft error detection; Automatic recovery; System-level checkpoint; User-level checkpoint;

    机译:软错误检测;自动恢复;系统级别检查点;用户级检查点;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号