Opportunistic application-level fault detection through adaptive redundant multithreading

机译：机会主义应用程序级故障检测通过自适应冗余多线程

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As the scale and complexity of future High Performance Computing systems continues to grow, the rising frequency of faults and errors and their impact on HPC applications will make it increasingly difficult to accomplish useful computation. Traditional means of fault detection and correction are either hardware based or use software based redundancy. Redundancy based approaches usually entail complete replication of the program state or the computation and therefore incurs substantial overhead to application performance. Therefore, the wide-scale use of full redundancy in future exascale class systems is not a viable solution for error detection and correction. In this paper we present an application level fault detection approach that is based on adaptive redundant multithreading. Through a language level directive, the programmer can define structured code blocks. When these blocks are executed by multiple threads and their outputs compared, we can detect errors in specific parts of the program state that will ultimately determine the correctness of the application outcome. The compiler outlines such code blocks and a runtime system reasons whether their execution by redundant threads should enabled/disabled by continuously observing and learning about the fault tolerance state of the system. By providing flexible building blocks for application specific fault detection, our approach makes possible more reasonable performance overheads than full redundancy. Our results show that the overheads to application performance are in the range of 4% to 70% due to runtime system being continuously aware of the rate and source of system faults, rather than the usual overhead in the excess of 100% that is incurred by complete replication.

机译：随着未来高性能计算系统的规模和复杂性继续增长，故障频率的上升及其对HPC应用的影响将使有用的计算越来越困难。传统的故障检测方法和校正是基于硬件的或使用基于软件的冗余。基于冗余的方法通常需要完全复制程序状态或计算，因此会导致应用程序性能大量开销。因此，未来Exascale类系统中的全面冗余的广泛使用不是用于错误检测和校正的可行解决方案。在本文中，我们提出了一种基于自适应冗余多线程的应用级别故障检测方法。通过语言级指令，程序员可以定义结构化的代码块。当这些块由多个线程和它们的输出执行比较时，我们可以检测程序状态的特定部分中的错误，这些部分将最终确定应用结果的正确性。编译器概述了这样的代码块和运行时系统原因，无论是否通过持续观察和学习系统的容错状态，应通过冗余线程执行/禁用它们。通过为应用特定故障检测提供灵活的构建块，我们的方法可以使性能超过完全冗余。我们的结果表明，由于运行时系统不断了解系统故障的速率和来源，而不是超过100％所产生的超过100％的常规开销，因此申请表现的开销的开销范围为4％至70％完整复制。

著录项

来源
《International Conference on High Performance Computing Simulation》|2014年||共8页
会议地点
作者
Hukerikar Saurabh; Diniz Pedro C.; Lucas Robert F.; Teranishi Keita;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类一般性问题;
关键词
Fault detection; Hardware; Instruction sets; Multithreading; Redundancy; Runtime;

机译：故障检测;硬件;指令集;多线程;冗余;运行时;

相似文献

外文文献
中文文献
专利

1. RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading [J] . Saurabh Hukerikar, Keita Teranishi, Pedro C. Diniz, International journal of parallel programming . 2018,第2期

机译：RedThreads：通过自适应冗余多线程进行应用程序级故障检测/纠正的接口
2. Compound faults detection of rotating machinery using improved adaptive redundant lifting multiwavelet [J] . Jinglong Chen, Yanyang Zi, Zhengjia He, Mechanical systems and signal processing . 2013,第1期

机译：改进的自适应冗余提升小波在旋转机械复合故障检测中的应用
3. Adaptive redundant multiwavelet denoising with improved neighboring coefficients for gearbox fault detection [J] . Jinglong Chen, Yanyang Zi, Zhengjia He, Mechanical systems and signal processing . 2013,第2期

机译：改进的邻域系数自适应冗余多小波去噪，用于齿轮箱故障检测
4. Opportunistic application-level fault detection through adaptive redundant multithreading [C] . Hukerikar Saurabh, Diniz Pedro C., Lucas Robert F., International Conference on High Performance Computing Simulation . 2014

机译：通过自适应冗余多线程进行机会性的应用程序级故障检测
5. Fault detection and diagnostics of an HVAC sub-system using adaptive resonance theory neural networks. [D] . Jones, Christian Birk. 2015

机译：使用自适应共振理论神经网络对HVAC子系统进行故障检测和诊断。
6. Adaptive Redundant Lifting Wavelet Transform Based on Fitting for Fault Feature Extraction of Roller Bearings [O] . Zijing Yang, Ligang Cai, Lixin Gao, 2012

机译：基于拟合的自适应冗余提升小波变换在滚动轴承故障特征提取中的应用
7. RedThreads: An Interface for Application-level Fault Detection/Correction through Adaptive Redundant Multithreading [O] . Hukerikar, Saurabh, Teranishi, Keita, Diniz, Pedro C., 2017

机译：RedThreads：应用程序级故障的接口通过自适应冗余多线程进行检测/校正

Opportunistic application-level fault detection through adaptive redundant multithreading

摘要

著录项

相似文献

相关主题

期刊订阅