...
首页> 外文期刊>Frontiers of computer science in China >Iaso: an autonomous fault-tolerant management system for supercomputers
【24h】

Iaso: an autonomous fault-tolerant management system for supercomputers

机译:Iaso:超级计算机的自治容错管理系统

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
机译:随着系统规模的扩大,超级计算机的固有可靠性越来越低。故障处理和任务恢复的成本增长如此之快,以至于可靠性问题很快就会损害超级计算机的可用性。这个问题称为“可靠性墙”,它被认为是当前和未来超级计算机的关键问题。为了解决这个问题,我们在MilkyWay-2系统中提出了一个名为Iaso的自治容错系统。 Iaso引入了超级计算机中自主管理的概念。通过自主管理,计算机本身而不是人力来负责故障管理工作。 Iaso自动管理故障的整个生命周期,包括故障检测,故障诊断,故障隔离和任务恢复。 Iaso拥有MilkyWay-2系统的自治功能,例如自我意识,自我诊断,自我修复和自我保护。在Iaso的帮助下,超级计算机中故障处理的成本从几小时减少到几秒钟。 Iaso大大提高了MilkyWay-2系统的可用性和可靠性。

著录项

  • 来源
    《Frontiers of computer science in China》 |2014年第3期|378-390|共13页
  • 作者单位

    Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China,College of Computer, National University of Defense Technology, Changsha 410073, China;

    Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China,College of Computer, National University of Defense Technology, Changsha 410073, China;

    College of Computer, National University of Defense Technology, Changsha 410073, China;

    College of Computer, National University of Defense Technology, Changsha 410073, China;

    College of Computer, National University of Defense Technology, Changsha 410073, China;

    College of Computer, National University of Defense Technology, Changsha 410073, China;

    College of Computer, National University of Defense Technology, Changsha 410073, China;

    College of Computer, National University of Defense Technology, Changsha 410073, China;

    ATR Laboratory, National University of Defense Technology, Changsha 410073, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    supercomputer; autonomous management; fault tolerant; fault management; MilkyWay-2 system;

    机译:超级计算机自主管理;容错故障管理;MilkyWay-2系统;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号