首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >Understanding the propagation of transient errors in HPC applications
【24h】

Understanding the propagation of transient errors in HPC applications

机译:了解HPC应用程序中瞬态错误的传播

获取原文

摘要

Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.
机译:万亿级系统的弹性已迅速成为科学界关注的重要问题。尽管它很重要,但是关于故障的传播方式或对HPC应用的影响速度还有很多待定。对故障在何处以及如何快速传播的理解可能会导致更有效地实施应用程序驱动的错误检测和恢复。在这项工作中,我们提出了一个故障传播框架,以分析故障在MPI应用程序中的传播方式并了解其对故障的脆弱性。我们采用了编译器级代码转换和检测以及运行时检查程序的组合。利用我们框架提供的信息,我们采用机器学习技术来推导应用程序故障传播模型,该模型可用于估计运行时损坏的内存位置的数量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号