首页> 外文会议>International Conference on High Performance Computing Simulation >Opportunistic application-level fault detection through adaptive redundant multithreading
【24h】

Opportunistic application-level fault detection through adaptive redundant multithreading

机译:机会主义应用程序级故障检测通过自适应冗余多线程

获取原文

摘要

As the scale and complexity of future High Performance Computing systems continues to grow, the rising frequency of faults and errors and their impact on HPC applications will make it increasingly difficult to accomplish useful computation. Traditional means of fault detection and correction are either hardware based or use software based redundancy. Redundancy based approaches usually entail complete replication of the program state or the computation and therefore incurs substantial overhead to application performance. Therefore, the wide-scale use of full redundancy in future exascale class systems is not a viable solution for error detection and correction. In this paper we present an application level fault detection approach that is based on adaptive redundant multithreading. Through a language level directive, the programmer can define structured code blocks. When these blocks are executed by multiple threads and their outputs compared, we can detect errors in specific parts of the program state that will ultimately determine the correctness of the application outcome. The compiler outlines such code blocks and a runtime system reasons whether their execution by redundant threads should enabled/disabled by continuously observing and learning about the fault tolerance state of the system. By providing flexible building blocks for application specific fault detection, our approach makes possible more reasonable performance overheads than full redundancy. Our results show that the overheads to application performance are in the range of 4% to 70% due to runtime system being continuously aware of the rate and source of system faults, rather than the usual overhead in the excess of 100% that is incurred by complete replication.
机译:随着未来高性能计算系统的规模和复杂性继续增长,故障频率的上升及其对HPC应用的影响将使有用的计算越来越困难。传统的故障检测方法和校正是基于硬件的或使用基于软件的冗余。基于冗余的方法通常需要完全复制程序状态或计算,因此会导致应用程序性能大量开销。因此,未来Exascale类系统中的全面冗余的广泛使用不是用于错误检测和校正的可行解决方案。在本文中,我们提出了一种基于自适应冗余多线程的应用级别故障检测方法。通过语言级指令,程序员可以定义结构化的代码块。当这些块由多个线程和它们的输出执行比较时,我们可以检测程序状态的特定部分中的错误,这些部分将最终确定应用结果的正确性。编译器概述了这样的代码块和运行时系统原因,无论是否通过持续观察和学习系统的容错状态,应通过冗余线程执行/禁用它们。通过为应用特定故障检测提供灵活的构建块,我们的方法可以使性能超过完全冗余。我们的结果表明,由于运行时系统不断了解系统故障的速率和来源,而不是超过100%所产生的超过100%的常规开销,因此申请表现的开销的开销范围为4%至70%完整复制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号