首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory
【24h】

Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory

机译:洛斯阿拉莫斯国家实验室生产和退役的高性能计算平台的正确性现场测试

获取原文

摘要

Silent Data Corruption (SDC) can threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems. To characterize this issue, correctness field testing of HPC platforms at Los Alamos National Laboratory was performed. This work presents results for 12 platforms, including over 1,000 node-years of computation performed on over 8,750 compute nodes and over 260 petabytes of data transfers involving nearly 6,000 compute nodes, and relevant lessons learned. Incorrect results characteristic of transient errors and of intermittent errors were observed. These results are a key underpinning to resilience efforts as they provide signatures of incorrect results observed under field conditions. Five incorrect results consistent with a transient error mechanism were observed, suggesting that the effects of transient errors could be mitigated. However, the observed numbers of incorrect results consistent with an intermittent error mechanism suggest that intermittent errors could substantially effect computational correctness.
机译:静默数据损坏(SDC)可能会威胁在高性能计算(HPC)平台和其他系统上执行的科学计算的完整性。为了解决这个问题,在洛斯阿拉莫斯国家实验室对HPC平台进行了正确性现场测试。这项工作介绍了12个平台的结果,包括在8,750多个计算节点上执行了1,000节点年的计算,以及涉及将近6,000个计算节点的260 PB的数据传输,以及相关的经验教训。观察到瞬时错误和间歇错误的错误结果。这些结果提供了在野外条件下观察到的不正确结果的特征,这是增强抗灾能力的关键基础。观察到与瞬态误差机制一致的五个错误结果,表明可以减轻瞬态误差的影响。但是,观察到的与间歇错误机制一致的错误结果数量表明,间歇错误可能会严重影响计算的正确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号