首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
【24h】

GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

机译:泰坦超级计算机上的GPU寿命:生存分析和可靠性

获取原文

摘要

The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan’s 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.
机译:Cray XK7泰坦是世界上最长的超级计算机系统,长期以来,在其近七年的生活中仍然严重重要。这是一个有趣的机器,从可靠性的角度来看,其大部分电源来自18,688个GPU,其操作被迫执行三个返工周期,两个在GPU机械组件上以及GPU电路板上的一个。我们在泰坦6年的长期生产期间写下了最后100,000多年的GPU寿命的最后返工周期和可靠性分析。使用失败分析和统计生存分析技术之间的时间,我们发现GPU可靠性取决于散热,与冷却架构和作业调度的详细细微差别强烈相关。我们描述了历史,数据收集,清洁和分析,并为未来的超级计算系统提供建议。我们通过公开提供数据和分析代码。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号