首页> 外文会议>International Symposium on Reliable Distributed Systems >Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults
【24h】

Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults

机译:能够抵抗多种硬故障的高性能硬件上的线性系统求解

获取原文

摘要

As large-scale linear equation systems are pervasive in many scientific fields, great efforts have been done over the last decade in realizing efficient techniques to solve such systems, possibly relying on High Performance Computing (HPC) infrastructures to boost the performance. In this framework, the ever-growing scale of supercomputers inevitably increases the frequency of faults, making it a crucial issue of HPC application development.A previous study [1] investigated the possibility to enhance the Inhibition Method (IMe) –a linear systems solver for dense unstructured matrices-with fault tolerance to single hard errors, i.e. failures causing one computing processor to stop.This article extends [1] by proposing an efficient technique to obtain fault tolerance to multiple hard errors, which may occur concurrently on different processors belonging to the same or different machines. An improved parallel implementation is also proposed, which is particularly suitable for HPC environments and moves towards the direction of a complete decentralization. The theoretical analysis suggests that the technique (which does not require check pointing, nor rollback) is able to provide fault tolerance to multiple faults at the price of a small overhead and a limited number of additional processors to store the checksums. Experimental results on a HPC architecture validate the theoretical study, showing promising performance improvements w.r.t. a popular fault-tolerant solving technique.
机译:由于大规模线性方程系统在许多科学领域中无处不在,因此在过去的十年中,人们为实现有效的技术来解决此类系统付出了巨大的努力,可能依靠高性能计算(HPC)基础架构来提高性能。在这种框架下,超级计算机规模的不断扩大不可避免地增加了故障发生的频率,这使其成为HPC应用开发的关键问题。先前的研究[1]研究了增强抑制方法(IMe)的可能性–一种线性系统求解器对于密集的非结构化矩阵-具有对单个硬错误的容错能力,即导致一个计算处理器停止的故障。本文通过提出一种有效的技术来获得对多个硬错误的容错能力,扩展了容错能力[1],该错误可能在属于不同处理器的多个处理器上同时发生到相同或不同的机器上。还提出了一种改进的并行实施方式,该实施方式特别适用于HPC环境,并朝着完全分散化的方向发展。理论分析表明,该技术(不需要校验点,也不需要回滚)能够以较小的开销和有限数量的附加处理器来存储校验和,从而为多个故障提供容错能力。在HPC架构上的实验结果验证了理论研究,显示出在性能方面有希望的改进。一种流行的容错解决技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号