Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility

机译：橡树岭领导计算设施的泰坦超级计算机从GPU经验中学到的可靠性课程

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

机译：图形处理单元（GPU）的高计算能力正在启用并推动大规模的科学发现过程。全球开放科学第二快的超级计算机Titan拥有超过18,000个GPU，供计算科学家用来执行科学仿真和数据分析。但是，由于最近才大规模部署GPU，因此对GPU可靠性特性的了解仍处于起步阶段。本文对GPU错误及其对系统操作和应用的影响进行了详细研究，描述了Titan超级计算机上18,688个GPU的使用经验，以及在GPU大规模有效运行过程中获得的经验教训。这些经验对已经具有大规模GPU群集或计划在将来部署GPU的HPC站点很有帮助。

著录项

来源
《International Conference for High Performance Computing, Networking, Storage and Analysis》|2015年|1-12|共12页
会议地点
作者
Devesh Tiwari; Saurabh Gupta; George Gallarno; Jim Rogers; Don Maxwell;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Graphics processing units; Supercomputers; Computer architecture; Error correction codes; Hardware; Instruction sets; Reliability;

机译：图形处理单元;超级计算机;计算机体系结构;纠错码;硬件;指令集;可靠性;

相似文献

外文文献
中文文献
专利

1. AT OAK RIDGE ............... TITAN SUPERCOMPUTER HITS SNAG DURING TESTING [J] . Nuclear Weapons & Materials Monitor（BMI） . 2013,第8期

机译：在OAK RIDGE上............ TITAN超级计算机在测试过程中陷于陷阱
2. AT OAK RIDGE.........ORNL MAKING PROGRESS TOWARD TITAN SUPERCOMPUTER [J] . Nuclear Weapons & Materials Monitor . 2012,第36期

机译：在OAK RIDGE ......... ORNL向TITAN SUPERCOMPUTER迈进
3. Oak Ridge ‘Titan’ Tops Supercomputers [J] . Platt's Inside EnergyEXTRA . 2012,第1112APP期

机译：Oak Ridge的“ Titan”荣登超级计算机榜首
4. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility [C] . Devesh Tiwari, Saurabh Gupta, George Gallarno, International Conference for High Performance Computing, Networking, Storage and Analysis . 2015

机译：在橡木岭领导计算机计算设施中与Titan SuperCupsiler的GPU经验中吸取的可靠性经验
5. Lessons of experience: Key events and lessons learned of effective chief medical officers at freestanding children's hospitals. [D] . Nowill, Donald P. 2009

机译：经验教训：在独立儿童医院中有效的首席医疗官的重要事件和经验教训。
6. Experiences of Assisted Living Communities Affected by Hurricane Irma: Leadership Lessons Learned [O] . Kathryn Hyer, Lindsay Peterson, David Dosa, 2020

机译：受飓风IRMA影响的辅助生活社区的经验：领导经验教训
7. US Department of Energy, Office of Science High Performance Computing Facility Operational Assessment 2019 Oak Ridge Leadership Computing Facility [O] . J. Abston, Ryan Adamson, Scott Atchley, 2020

机译：美国能源部，科学办公室高性能计算设施运营评估2019橡木岭领导计算设施
8. High Performance Computing Facility Operational Assessment 2013 Oak Ridge Leadership Computing Facility [R] . Barker, AD, Bernholdt, DE, Bland, A S, 2014

机译：高性能计算设施运营评估2013橡树岭领导计算设施

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility

摘要

著录项

相似文献

相关主题

期刊订阅