首页> 外文会议>2015 IEEE 21st International Symposium on High Performance Computer Architecture >Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
【24h】

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

机译:了解大规模HPC系统上的GPU错误以及对系统设计和操作的影响

获取原文
获取原文并翻译 | 示例

摘要

Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.
机译:图形硬件性能的提高和可编程性的提高使GPU能够从图形专用加速器发展为通用计算设备。泰坦(Titan)是2014年全球开放科学第二快的超级计算机,它由更笨拙的18,000个GPU组成,来自天体物理学,聚变,气候和燃烧等各个领域的科学家通常使用该GPU进行大规模仿真。不幸的是,尽管众所周知GPU的性能效率,但它们在大型计算系统中的弹性特性尚未得到充分评估。我们提供一份详细的研究报告,以提供对大规模启用GPU的系统上的GPU错误的透彻了解。我们的数据是从Oak Ridge领导力计算设施的Titan超级计算机和Los Alamos国家实验室的GPU​​群集中收集的。我们还将介绍我们在洛斯阿拉莫斯中子科学中心(LANSCE)和ISIS(英国卢瑟福·阿普伦实验室)进行的广泛中子束测试的结果,以测量不同代GPU的弹性。我们将从现场数据和中子束实验中获得一些发现,并讨论我们的结果对未来GPU架构师,当前和将来的HPC计算设施以及关注GPU弹性的研究人员的意义。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号