首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling
【24h】

Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling

机译:DemyStify GPU可靠性:比较和结合光束实验,故障模拟和分析

获取原文

摘要

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 imes$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.
机译:图形处理单元(GPU)已从作为多媒体和游戏应用的专用设备移动到高性能计算(HPC)和安全关键应用中使用的通用加速器,如自主车辆。这种市场转变导致GPU的计算能力和效率,对编程框架和性能评估工具的显着改进以及对其硬件可靠性的担忧。在本文中,我们比较和结合了高能中子梁实验,该实验占100多万年的自然地面曝光,广泛的建筑水平故障模拟,需要超过350 GPU小时(使用Sassifi和NVBITFI),以及详细的应用 - 披露。我们的主要目标是回答GPU可靠性评估中的一个基本开放性问题:故障仿真提供了代表性的结果,可用于预测GPU上运行的工作负载的失败率。我们表明,在大多数情况下,基于故障模拟的静默数据损坏的预测足够接近(差异低于5美元倍元)到实验测量的速率。我们还分析了一些主要的GPU功能单元的可靠性(包括混合精密和张量核心)。我们发现GPU资源的实例化方式在整体系统可靠性中发挥着关键作用,并且功能单元之外的故障产生最可检测的错误。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号