首页> 外文会议>IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale >Improving Application Resilience by Extending Error Correction with Contextual Information
【24h】

Improving Application Resilience by Extending Error Correction with Contextual Information

机译:通过使用上下文信息扩展错误校正来提高应用程序的弹性

获取原文

摘要

Extreme-scale systems are growing in scope and complexity as we approach exascale. Uncorrectable faults in such systems are also increasing, so resilience efforts addressing these are of great importance. In this paper, we extend a method that augments hardware error detection and correction (EDAC) contextually, and show an application-based approach that takes detectable uncorrectable (DUE) data errors and corrects them. We applied this application-based method successfully to data errors found using common EDAC, and discuss operating system changes that will make this possible on existing systems. We show that even when there are many acceptable correction choices (which may be seen in floating point), a large percentage of DUEs are corrected, and even the miscorrected data are very close to correct. We developed two different contextual criteria for this application: local averaging and global conservation of mass. Both did well in terms of closeness, but conservation of mass outperformed averaging in terms of actual correctness. The contributions of this paper are: 1) the idea of application- specific EDAC-based contextual correction, 2) its demonstration with great success on a real application, 3) the development of two different contextual criteria, and 4) a discussion of attainable changes to the OS kernel that make this possible on a real system.
机译:随着我们接近亿亿级规模,超大规模系统的范围和复杂性也在不断增长。这种系统中不可纠正的故障也在增加,因此解决这些故障的弹性工作非常重要。在本文中,我们扩展了一种在上下文中增强硬件错误检测和纠正(EDAC)的方法,并展示了一种基于应用程序的方法,该方法采用可检测的不可纠正(DUE)数据错误并进行纠正。我们成功地将这种基于应用程序的方法应用于使用常见EDAC发现的数据错误,并讨论了使现有系统上实现此目标的操作系统更改。我们表明,即使有许多可接受的校正选项(在浮点中也可以看到),大部分的DUE也会被校正,甚至错误校正的数据也非常接近校正。我们为此应用开发了两种不同的上下文标准:局部平均和整体质量守恒。两者在紧密度方面都做得不错,但在实际正确性方面,质量守恒表现优于平均值。本文的贡献是:1)基于特定应用程序的基于EDAC的上下文校正的思想,2)在实际应用中的成功演示,3)两种不同上下文标准的开发,以及4)关于可达到的讨论对OS内核的更改使之在实际系统上成为可能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号