首页> 外文学位 >Application-based Focused Recovery (ABFR): Convenient Management of Latent Error Resilience Using Application Knowledge
【24h】

Application-based Focused Recovery (ABFR): Convenient Management of Latent Error Resilience Using Application Knowledge

机译:基于应用程序的重点恢复(ABFR):使用应用程序知识方便地管理潜在的错误恢复能力

获取原文
获取原文并翻译 | 示例

摘要

Supercomputers continue to increase in scale and complexity to meet the demands of science and engineering. Exascale systems face high error rates due to increasing scale (109 cores), software complexity and rising memory error rates. Increasingly, errors escape immediate hardware-level detection, silently corrupting application states. Such latent errors can often be detected by application-level tests but typically at long latencies. Challenges for latent errors include determining when the error occurred, what data was corrupted, and how to recover efficiently. The predicted high error rates and latent errors are a critical problem that will increase the cost and may ultimately limit the scale of application science. However, existing fault tolerance approaches lack the support for latent errors. There is no general guidance to design latent error resilience.;This dissertation proposes a new approach called Application- Based Focused Recovery (ABFR) for high-performance applications to execute efficiently in an environment with high error rates and latent errors. This approach exploits application knowledge to focus the recovery on only potentially corrupted data, achieving efficient and scalable latent error resilience. The two key ideas of ABFR are (1) clearly define the application knowledge needed for latent error recovery (as embodied in the four ABFR operators); (2) provide powerful runtime support to manage the complex recovery procedures, using the four application operators, without any other application programmer effort.;ABFR is a well-defined resilience framework that allows the application to pursue strategies exploiting a range of application semantics. Application designers can express their knowledge flexibly in four ABFR operators. ABFR is also an application-system partnership that provides a clear separation between application knowledge and the underlying system. Application designers implement four operators without concern for the underlying architecture and system details. The ABFR runtime implements the complex recovery procedure, including triggering and composing the operators, exploiting parallelism, and achieving load balance. Together, these ABFR properties support flexible application-based resilience.;To demonstrate ABFR's generality, we apply it to three varied scientific computation archetypes (stencil, N-Body tree, and Monte Carlo particle transport). We design ABFR operators for each computation and evaluate the performance of ABFR. We measure latent error resilience performance for varied error rates. Results indicate ABFR significantly improves recovery performance. Specifically, ABFR reduces error recovery cost by 2.4x to 367x, recovery latency by 2.2x to 24x) and I/O cost up to 1000x. ABFR achieves efficient and scalable recovery at scale with high latent error rates for all three computation archetypes. Note that these results may be improved by more sophisticated application ABFR operators.;Overall, this dissertation demonstrates a new approach for efficient, scalable latent error recovery on large-scale systems. ABFR enables flexible application-based error resilience and provides sophisticated runtime support. As a result, applications are able to tolerate higher error rates and latent errors.
机译:超级计算机的规模和复杂性不断增加,以满足科学和工程的需求。由于扩展规模(109核),软件复杂性和不断增加的内存错误率,Exascale系统面临高错误率。错误越来越多地逃避了立即的硬件级检测,从而无声地破坏了应用程序状态。这样的潜在错误通常可以通过应用程序级测试来检测,但是通常需要很长的等待时间。潜在错误的挑战包括确定何时发生错误,损坏了哪些数据以及如何有效恢复。预测的高错误率和潜在错误是一个关键问题,它将增加成本并最终限制应用程序科学的规模。但是,现有的容错方法缺乏对潜在错误的支持。没有设计潜在错误弹性的一般指导。本文提出了一种新方法,称为基于应用程序的集中恢复(ABFR),用于使高性能应用程序在错误率和潜在错误率较高的环境中有效执行。这种方法利用应用程序知识将恢复仅集中在可能损坏的数据上,从而实现了高效且可扩展的潜在错误恢复能力。 ABFR的两个关键思想是:(1)明确定义潜在错误恢复所需的应用知识(体现在四个ABFR运算符中); (2)提供了强大的运行时支持,使用四个应用程序操作员来管理复杂的恢复过程,而无需任何其他应用程序程序员的努力。; ABFR是一个定义良好的弹性框架,它使应用程序能够利用各种应用程序语义来追求策略。应用设计人员可以在四个ABFR操作员中灵活地表达他们的知识。 ABFR也是一种应用程序与系统的伙伴关系,它在应用程序知识与基础系统之间提供了清晰的隔离。应用程序设计人员实现了四个操作员,而无需担心基础架构和系统细节。 ABFR运行时实现了复杂的恢复过程,包括触发和组合运算符,利用并行性以及实现负载平衡。这些ABFR属性共同支持基于应用程序的灵活弹性。为了证明ABFR的通用性,我们将其应用于三种不同的科学计算原型(模板,N-Body树和蒙特卡洛粒子传输)。我们为每次计算设计ABFR运算符,并评估ABFR的性能。我们针对各种错误率测量潜在的错误恢复能力。结果表明,ABFR显着提高了恢复性能。具体而言,ABFR将错误恢复成本降低了2.4倍至367倍,恢复延迟降低了2.2倍至24倍),并将I / O成本降低了1000倍。对于这三种计算原型,ABFR都可以实现具有高潜在错误率的大规模有效且可扩展的恢复。注意,更复杂的应用程序ABFR运算符可能会改善这些结果。总的来说,本文演示了一种在大规模系统上有效,可扩展的潜在错误恢复的新方法。 ABFR支持灵活的基于应用程序的错误恢复能力,并提供完善的运行时支持。结果,应用程序可以容忍更高的错误率和潜在错误。

著录项

  • 作者

    Fang, Aiman.;

  • 作者单位

    The University of Chicago.;

  • 授予单位 The University of Chicago.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 126 p.
  • 总页数 126
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 宗教;
  • 关键词

  • 入库时间 2022-08-17 11:53:07

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号