首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
【24h】

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility

机译:橡树岭领导计算设施的泰坦超级计算机从GPU经验中学到的可靠性课程

获取原文

摘要

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
机译:图形处理单元(GPU)的高计算能力正在启用并推动大规模的科学发现过程。全球开放科学第二快的超级计算机Titan拥有超过18,000个GPU,供计算科学家用来执行科学仿真和数据分析。但是,由于最近才大规模部署GPU,因此对GPU可靠性特性的了解仍处于起步阶段。本文对GPU错误及其对系统操作和应用的影响进行了详细研究,描述了Titan超级计算机上18,688个GPU的使用经验,以及在GPU大规模有效运行过程中获得的经验教训。这些经验对已经具有大规模GPU群集或计划在将来部署GPU的HPC站点很有帮助。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号