首页> 外文会议>2012 International Conference for High Performance Computing, Networking, Storage and Analysis. >Detection and correction of silent data corruption for large-scale high-performance computing
【24h】

Detection and correction of silent data corruption for large-scale high-performance computing

机译:大规模高性能计算的静默数据损坏检测和纠正

获取原文
获取原文并翻译 | 示例

摘要

Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.
机译:对于高端计算集群而言,故障已成为常态而不是例外。加剧了这种情况,其中一些故障仍然未被发现,表现为静默错误,允许应用程序计算错误结果。本文研究了在MPI消息传递应用程序中检测和纠正软错误的冗余潜力,同时通过提供透明的MPI冗余来调查在MPI应用程序中检测软错误所固有的挑战。通过假设一个模型,其中应用程序数据中的损坏会通过在副本之间产生不同的MPI消息来表现出来,我们研究了最适合检测和纠正损坏的MPI消息的协议。使用我们的故障注入器,我们观察到即使是单个错误也会通过导致一系列级联的损坏模式而对应用程序产生深远的影响,这种损坏模式在大多数情况下会蔓延到所有其他过程。结果表明,我们的一致性协议可以成功保护遭受静默数据损坏甚至更高的应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号