首页> 外文会议>IEEE International Symposium on Policies for Distributed Systems and Networks >Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
【24h】

Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation

机译:通过完整方法实现用户级别容错策略管理

获取原文

摘要

Many modern scientific applications, which are designed to utilize high performance parallel computers, occupy hundreds of thousands of computational cores running for days or even weeks. Since many scientists compete for resources, most supercomputing centers practice strict scheduling policies and perform meticulous accounting on their usage. Thus computing resources and time assigned to a user is considered invaluable. However, most applications are not well prepared for unforeseeable faults, still relying on primitive fault tolerance techniques. Considering that ever-plunging mean time to interrupt (MTTI) is making scientific applications more vulnerable to faults, it is increasingly important to provide users not only an improved fault tolerant environment, but also a framework to support their own fault tolerance policies so that their allocation times can be best utilized. This paper addresses a user level fault tolerance policy management based on a holistic approach to digest and correlate fault related information. It introduces simple semantics with which users express their policies on faults, and illustrates how event correlation techniques can be applied to manage and determine the most preferable user policies. The paper also discusses an implementation of the framework using open source software, and demonstrates, as an example, how a molecular dynamics simulation application running on the institutional cluster at Oak Ridge National Laboratory benefits from it.
机译:许多现代科学应用,旨在利用高性能并行计算机,占用数十万个计算核心,持续数天甚至数周。由于许多科学家竞争资源,大多数超级计算中心练习严格的调度政策,并对他们的使用进行细致的会计。因此,所分配给用户的计算资源和时间被认为是宝贵的。然而,大多数应用程序对不可预见的断层没有充分准备,仍然依赖于原始的容错技术。考虑到陷入困境的中断(MTTI)的平均时间(MTTI)正在使科学应用程序更容易受到故障,而且越来越重要,不仅为用户提供改善的容错环境,还可以支持他们自己的容错策略,以便他们的框架分配时间可以最好地利用。本文根据摘要和相关性相关信息的整体方法,解决了用户级别容错策略管理。它介绍了简单的语义,用户可以在其中表达对故障的策略,并说明如何应用事件相关技术来管理和确定最优选的用户策略。本文还讨论了使用开源软件的框架的实施,并以橡树岭国家实验室的利益在机构集群上运行的分子动力学仿真应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号