Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling

机译：DemyStify GPU可靠性：比较和结合光束实验，故障模拟和分析

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 imes$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

机译：图形处理单元（GPU）已从作为多媒体和游戏应用的专用设备移动到高性能计算（HPC）和安全关键应用中使用的通用加速器，如自主车辆。这种市场转变导致GPU的计算能力和效率，对编程框架和性能评估工具的显着改进以及对其硬件可靠性的担忧。在本文中，我们比较和结合了高能中子梁实验，该实验占100多万年的自然地面曝光，广泛的建筑水平故障模拟，需要超过350 GPU小时（使用Sassifi和NVBITFI），以及详细的应用 - 披露。我们的主要目标是回答GPU可靠性评估中的一个基本开放性问题：故障仿真提供了代表性的结果，可用于预测GPU上运行的工作负载的失败率。我们表明，在大多数情况下，基于故障模拟的静默数据损坏的预测足够接近（差异低于5美元倍元）到实验测量的速率。我们还分析了一些主要的GPU功能单元的可靠性（包括混合精密和张量核心）。我们发现GPU资源的实例化方式在整体系统可靠性中发挥着关键作用，并且功能单元之外的故障产生最可检测的错误。

著录项

来源
《IEEE International Parallel and Distributed Processing Symposium》|2021年|289-298|共10页
会议地点
作者
Fernando Fernandes dos Santos; Siva Kumar Sastry Hari; Pedro Martins Basso; Luigi Carro; Paolo Rech;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Performance evaluation; Particle beams; Tensors; Sensitivity; Graphics processing units; Predictive models; Tools;

机译：性能评估;粒子束;张量;灵敏度;图形处理单元;预测模型;工具;

相似文献

外文文献
中文文献
专利

1. A methodology to determine reliability issues in automotive SiC power modules combining 1D and 3D thermal simulations under driving cycle profiles [J] . Matallana A., Robles E., Ibarra E., Microelectronics & Reliability . 2019,第Nova期

机译：确定汽车SiC电源模块中可靠性问题的方法，该方法结合了行驶周期曲线下的1D和3D热模拟
2. Profiles of volumetric water content in fault zones retrieved from hole B of the Taiwan Chelungpu-fault Drilling Project (TCDP) [J] . Lin W, Matsubayashi O, Yeh EC, Geophysical Research Letters . 2008,第1期

机译：从台湾切龙普断层钻探项目（TCDP）的B孔获取的断层带中的含水量分布图
3. Comparing the treecode with FMM on GPUs for vortex particle simulations of a leapfrogging vortex ring [J] . Rio Yokota, L. A. Barba Computers & Fluids . 2011,第1期

机译：将树码与GPU上的FMM进行比较，以模拟越级涡旋环的涡旋粒子
4. Demystifying Soft Error Assessment Strategies on ARM CPUs: Microarchitectural Fault Injection vs. Neutron Beam Experiments [C] . Athanasios Chatzidimitriou, Pablo Bodmann, George Papadimitriou, Annual IEEE/IFIP International Conference on Dependable Systems and Networks . 2019

机译：使ARM CPU的软错误评估策略神秘化：微体系结构故障注入与中子束实验
5. Simulations of Tailored Gas Density and Plasma Profiles for Plasma Wakefield Acceleration Experiments [D] . Mock, Matthew. 2018

机译：仿制气体密度和等离子体型材的仿真对等离子体韦克菲尔德加速实验
6. A GPU tool for efficient accurate and realistic simulation of cone beam CT projections [O] . Xun Jia, Hao Yan, Laura Cerviño, -1

机译：一种用于对锥束CT投影进行高效准确和逼真的仿真的GPU工具
7. Fault-tolerant Control of Unmanned Underwater Vehicles with Continuous Faults: Simulations and Experiments [O] . Qian Liu, Daqi Zhu 2009

机译：具有连续故障的无人水下航行器的容错控制：仿真与实验

Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling

摘要

著录项

相似文献

相关主题

期刊订阅