...
首页> 外文期刊>Concurrency and Computation >Application health monitoring for extreme-scale resiliency using cooperative fault management
【24h】

Application health monitoring for extreme-scale resiliency using cooperative fault management

机译:使用合作故障管理的应用程序运行状况监视,以实现极端规模的弹性

获取原文
获取原文并翻译 | 示例
           

摘要

Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. The developed approach is general and can be easily applied to other applications.
机译:弹性是并且将是确定当前和百亿亿级超级计算机以及其他方面的科学生产率的关键因素。忽略并无法处理瞬时软硬错误的应用程序可能会浪费超级计算资源,或者更糟的是,会产生误导性的科学见解。我们介绍了一种基于应用程序运行状况监视的新颖的应用程序驱动的静默错误检测和恢复策略。我们的方法使用遵循已知模式的应用程序输出来指示应用程序的运行状况,并了解违反这些模式可能表示故障。来自系统监视器的报告硬件和软件运行状况的信息将用于确认故障。总的来说,故障协调器代理将这些信息用于通过对检查点之间的应用程序应用计算控制来采取预防和纠正措施。该协作式故障管理系统使用容错底板作为通信通道。该框架的好处通过两个真实的应用案例研究(分子动力学和量子化学模拟)在具有模拟内存和I / O损坏的可伸缩群集上得到了证明。开发的方法是通用的,可以轻松应用于其他应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号