Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications

机译：GPGPU应用实际可靠性分析的故障网站修剪

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications.

机译：图形处理单元（GPU）迅速发展到使高能效的数据并行计算为广泛的科学领域。虽然GPU的实现在严格的功率预算百亿亿次的表现，他们也容易受到软错误，通常由高能粒子轰击造成的，可以显著影响应用程序的输出质量。了解通用GPU应用的弹性是本文研究的目的。为此，必须通过在所有潜在的故障点注入故障，探索应用程序输出的范围。这个问题尤其具有挑战性，因为不像CPU的应用，大部分是单线程的，GPGPU应用程序可以包含数百至数千个线程，从而产生一个非常大的故障部位的空间 - 在几十亿甚至对于一些简单的应用程序的顺序。在本文中，我们提出了一个系统的方法来逐步修剪故障部位的空间，旨在大幅减少故障的注射次数，从而为GPGPU应用程序错误恢复能力评估可以实用。后面我们所提出的方法的关键洞察力的事实，GPGPU的应用催生了大量的线程，然而，许多人执行相同的指令集茎。因此，几个故障点是多余的，可以通过跨线程和指令故障认真分析被修剪。我们找出重要的功能在一组从罗迪尼亚和Polybench套房10个应用程序（16粒），并得出结论，线程可以根据他们执行动态指令数先分类。我们通过分析只有那些代表GPGPU应用程序的动态指令的行为（因此错误恢复行为）的线程的一小部分实现显著故障部位减少。这代表线组，B）循环迭代的代表线程内的一个子集，以及c）目的地寄存器位的子集内的）跨码块的动态指令的共性（和差异）：进一步修剪是通过识别和分析来实现位置。上述步骤导致高达7个数量级的大幅度减少故障部位。然而，这降低了故障的网站空间准确地抓住的GPGPU应用程序的错误恢复配置文件。

著录项

来源
《International Symposium on Microarchitecture》|2018年|xxiv 493 p. :|共13页
会议地点
作者
Bin Nie; Lishan Yang; Adwait Jog; Evgenia Smirni;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
Kernel; Resilience; Graphics processing units; Registers; Reliability; Message systems; Instruction sets;

机译：内核;弹性;图形处理单元;寄存器;可靠性;消息系统;指令集;

相似文献

外文文献
中文文献
专利

1. Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults [J] . Yang Lishan, Nie Bin, Jog Adwait, IEEE Transactions on Computers . 2021,第1期

机译：在单一和多位故障存在下GPGPU应用的实用恢复分析
2. Transmission Line Fault Clearing System Reliability Assessment: Application of Life Data Analysis with Weibull Distribution and Reliability Block Diag [J] . Mohd Iqbal Ridwan, Mohd Radzian Abdul Rahman, Bahisham Yunus, Journal of Emerging Technologies in Web Intelligence . 2013,第2期

机译：输电线路故障排除系统可靠性评估：寿命数据分析与威布尔分布和可靠性块诊断的应用
3. A practical approach for reliability and maintainability analysis of repairable systems: a case study of polypropylene production for food packaging applications [J] . Panagiotis H. Tsarouhas International Journal of Reliability and Safety . 2015,第4期

机译：可修复系统可靠性和可维护性分析的实用方法：以食品包装用聚丙烯生产为例
4. Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications [C] . Bin Nie, Lishan Yang, Adwait Jog, Annual IEEE/ACM International Symposium on Microarchitecture . 2018

机译：故障现场修剪，用于GPGPU应用程序的实用可靠性分析
5. GPGPU Reliability Analysis: From Applications to Large Scale Systems [D] . Nie, Bin. 2019

机译：GPGPU可靠性分析：从应用程序到大型系统
6. Optimizing the Reliability and Performance of Service Composition Applications with Fault Tolerance in Wireless Sensor Networks [O] . Zhao Wu, Naixue Xiong, Yannong Huang, 2015

机译：通过无线传感器网络中的容错功能优化服务组合应用程序的可靠性和性能
7. Reliability-centered maintenance of the Electrically Insulated Railway Joint via Fault Tree Analysis: A practical experience report [O] . Ruijters, Enno Jozef Johannes, Guck, Dennis, van Noort, Martijn, 2016

机译：通过故障树分析对以电气为中心的电气绝缘接头进行以可靠性为中心的维护：实践经验报告
8. Additional Analysis of the ESTCP Discrimination Study Data at Camp Sibert, Alabama. Project 200504: Practical Discrimination Strategies for Application to Live Sites [R] . Billings, S., Pasion, L., Beran, L. 2008

机译：对阿拉巴马州西伯特营地的EsTCp歧视研究数据的补充分析。项目200504：现场应用的实用歧视策略

Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications

摘要

著录项

相似文献

相关主题

期刊订阅