首页> 外文会议>International Symposium on Microarchitecture >Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications
【24h】

Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications

机译:GPGPU应用实际可靠性分析的故障网站修剪

获取原文

摘要

Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications.
机译:图形处理单元(GPU)迅速发展到使高能效的数据并行计算为广泛的科学领域。虽然GPU的实现在严格的功率预算百亿亿次的表现,他们也容易受到软错误,通常由高能粒子轰击造成的,可以显著影响应用程序的输出质量。了解通用GPU应用的弹性是本文研究的目的。为此,必须通过在所有潜在的故障点注入故障,探索应用程序输出的范围。这个问题尤其具有挑战性,因为不像CPU的应用,大部分是单线程的,GPGPU应用程序可以包含数百至数千个线程,从而产生一个非常大的故障部位的空间 - 在几十亿甚至对于一些简单的应用程序的顺序。在本文中,我们提出了一个系统的方法来逐步修剪故障部位的空间,旨在大幅减少故障的注射次数,从而为GPGPU应用程序错误恢复能力评估可以实用。后面我们所提出的方法的关键洞察力的事实,GPGPU的应用催生了大量的线程,然而,许多人执行相同的指令集茎。因此,几个故障点是多余的,可以通过跨线程和指令故障认真分析被修剪。我们找出重要的功能在一组从罗迪尼亚和Polybench套房10个应用程序(16粒),并得出结论,线程可以根据他们执行动态指令数先分类。我们通过分析只有那些代表GPGPU应用程序的动态指令的行为(因此错误恢复行为)的线程的一小部分实现显著故障部位减少。这代表线组,B)循环迭代的代表线程内的一个子集,以及c)目的地寄存器位的子集内的)跨码块的动态指令的共性(和差异):进一步修剪是通过识别和分析来实现位置。上述步骤导致高达7个数量级的大幅度减少故障部位。然而,这降低了故障的网站空间准确地抓住的GPGPU应用程序的错误恢复配置文件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号