首页> 外文会议>Parallel and distributed computing and networks >THE CASE FOR MODULAR REDUNDANCY IN LARGE-SCALE HIGH PERFORMANCE COMPUTING SYSTEMS
【24h】

THE CASE FOR MODULAR REDUNDANCY IN LARGE-SCALE HIGH PERFORMANCE COMPUTING SYSTEMS

机译:大型高性能计算系统中模块冗余的情况

获取原文
获取原文并翻译 | 示例

摘要

Recent investigations into resilience of large-scale high-performance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such as for aerospace and command & control systems. The primary argument against modular redundancy for resilience in HPC has always been that the capability of a HPC system, and respective return on investment, would be significantly reduced. We argue that modular redundancy can significantly increase compute node availability as it removes the impact of scale from single compute node MTTR. We further argue that single compute nodes can be much less reliable, and therefore less expensive, and still be highly available, if their MTTR/MTTF ratio is maintained.
机译:最近对大规模高性能计算(HPC)系统的弹性进行的研究表明,可靠性和可用性不断下降。与以前的系统相比,新安装的系统具有更低的平均故障时间(MTTF)和更高的平均恢复时间(MTTR)。如今,模块化冗余已在许多关键任务系统中使用,以提供弹性,例如用于航空航天和指挥与控制系统。反对模块化冗余以提高HPC的弹性的主要观点一直是,HPC系统的功能以及相应的投资回报将大大降低。我们认为模块化冗余可以显着提高计算节点的可用性,因为它消除了单个计算节点MTTR对规模的影响。我们进一步指出,如果维持单个计算节点的MTTR / MTTF比率,它们的可靠性可能会大大降低,因此价格会更低,并且仍然具有很高的可用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号