Understanding the propagation of transient errors in HPC applications

机译：了解HPC应用程序中瞬态错误的传播

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.

机译：万亿级系统的弹性已迅速成为科学界关注的重要问题。尽管它很重要，但是关于故障的传播方式或对HPC应用的影响速度还有很多待定。对故障在何处以及如何快速传播的理解可能会导致更有效地实施应用程序驱动的错误检测和恢复。在这项工作中，我们提出了一个故障传播框架，以分析故障在MPI应用程序中的传播方式并了解其对故障的脆弱性。我们采用了编译器级代码转换和检测以及运行时检查程序的组合。利用我们框架提供的信息，我们采用机器学习技术来推导应用程序故障传播模型，该模型可用于估计运行时损坏的内存位置的数量。

著录项

来源
《International Conference for High Performance Computing, Networking, Storage and Analysis》|2015年|1-12|共12页
会议地点
作者
Rizwan A. Ashraf; Roberto Gioiosa; Gokcen Kestor; Ronald F. DeMara; Chen-Yong Cher; Pradip Bose;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Circuit faults; Transient analysis; Hardware; Resilience; Computational modeling; Measurement; Registers;

机译：电路故障;瞬态分析;硬件;弹性;计算建模;测量;寄存器;

相似文献

外文文献
中文文献
专利

1. Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design [J] . Man-Lap Li, Pradeep Ramachandran, Swamp K. Sahoo, Computer architecture news . 2008,第1期

机译：了解硬错误向软件的传播及其对弹性系统设计的影响
2. Understanding the propagation of hard errors to software and implications for resilient system design [J] . Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2008,第3期

机译：了解硬错误向软件的传播及其对弹性系统设计的影响
3. Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications [J] . Caio Lunardi, Heather Quinn, Laura Monroe, IEEE Transactions on Nuclear Science . 2017,第8期

机译：HPC和大型服务器应用的排序算法错误严重性的实验和分析分析
4. Understanding the propagation of transient errors in HPC applications [C] . Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, International Conference for High Performance Computing, Networking, Storage and Analysis . 2015

机译：了解HPC应用程序中瞬态误差的传播
5. Interface selective transient grating spectroscopy: Theory and applications to thermal flow and acoustic propagation in thin films. [D] . Marshall, Christopher David. 1992

机译：界面选择性瞬态光栅光谱：薄膜中热流和声传播的理论和应用。
6. Application of the back-error propagation artificial neural network (BPANN) on genetic variants in the PPAR-γ and RXR-α gene and risk of metabolic syndrome in a Chinese Han population [O] . Xu Zhao, Kang Xu, Hui Shi, 2014

机译：反向误差传播人工神经网络（BPANN）在中国汉族人群PPAR-γ和RXR-α基因遗传变异和代谢综合征风险中的应用
7. Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters [O] . Ayush Patwari, Ignacio Laguna, Martin Schulz, 2017

机译：了解HPC集群中DRAM错误的空间特征

Understanding the propagation of transient errors in HPC applications

摘要

著录项

相似文献

相关主题

期刊订阅