Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults

机译：能够抵抗多种硬故障的高性能硬件上的线性系统求解

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As large-scale linear equation systems are pervasive in many scientific fields, great efforts have been done over the last decade in realizing efficient techniques to solve such systems, possibly relying on High Performance Computing (HPC) infrastructures to boost the performance. In this framework, the ever-growing scale of supercomputers inevitably increases the frequency of faults, making it a crucial issue of HPC application development.A previous study [1] investigated the possibility to enhance the Inhibition Method (IMe) –a linear systems solver for dense unstructured matrices-with fault tolerance to single hard errors, i.e. failures causing one computing processor to stop.This article extends [1] by proposing an efficient technique to obtain fault tolerance to multiple hard errors, which may occur concurrently on different processors belonging to the same or different machines. An improved parallel implementation is also proposed, which is particularly suitable for HPC environments and moves towards the direction of a complete decentralization. The theoretical analysis suggests that the technique (which does not require check pointing, nor rollback) is able to provide fault tolerance to multiple faults at the price of a small overhead and a limited number of additional processors to store the checksums. Experimental results on a HPC architecture validate the theoretical study, showing promising performance improvements w.r.t. a popular fault-tolerant solving technique.

机译：由于大规模线性方程系统在许多科学领域中无处不在，因此在过去的十年中，人们为实现有效的技术来解决此类系统付出了巨大的努力，可能依靠高性能计算（HPC）基础架构来提高性能。在这种框架下，超级计算机规模的不断扩大不可避免地增加了故障发生的频率，这使其成为HPC应用开发的关键问题。先前的研究[1]研究了增强抑制方法（IMe）的可能性–一种线性系统求解器对于密集的非结构化矩阵-具有对单个硬错误的容错能力，即导致一个计算处理器停止的故障。本文通过提出一种有效的技术来获得对多个硬错误的容错能力，扩展了容错能力[1]，该错误可能在属于不同处理器的多个处理器上同时发生到相同或不同的机器上。还提出了一种改进的并行实施方式，该实施方式特别适用于HPC环境，并朝着完全分散化的方向发展。理论分析表明，该技术（不需要校验点，也不需要回滚）能够以较小的开销和有限数量的附加处理器来存储校验和，从而为多个故障提供容错能力。在HPC架构上的实验结果验证了理论研究，显示出在性能方面有希望的改进。一种流行的容错解决技术。

著录项

来源
《International Symposium on Reliable Distributed Systems》|2020年|266-275|共10页
会议地点
作者
Daniela Loreti; Marcello Artioli; Anna Ciampolini;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Fault tolerance; multiple hard faults; High Performance Computing; linear equation systems solver; Inhibition Method;

机译：容错;多个硬故障;高性能计算;线性方程组求解器;抑制方法;

相似文献

外文文献
中文文献
专利

1. Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators [J] . Quintana-Orti G, Igual FD, Quintana-Orti ES, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2009,第4期

机译：在具有多个硬件加速器的平台上解决密集的线性系统
2. Solution of linear systems of equations in the presence of two transient hardware faults [J] . Fitzpatrick P., Murphy C.C. IEE Proceedings. Part E . 1993,第5期

机译：存在两个瞬时硬件故障的线性方程组的解
3. High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors [J] . Peng Du, Piotr Luszczek, Jack Dongarra Procedia Computer Science . 2012,第1期

机译：能够抵抗多个软错误的高性能密集线性系统求解器
4. Solving dense linear systems on platforms with multiple hardware accelerators [C] . Gregorio Quintana-Orti, Francisco D. Igual, Enrique S. Quintana-Orti, ACM SIGPLAN symposium on Principles and practice of parallel programming . 2009

机译：使用多个硬件加速器解决平台上的密集线性系统
5. Efficient Linear Matrix Solver and Its Hardware Implementations Dedicated to Faster-Than-Real-Time Dynamic Simulation of Large Scale of Power System [D] . Wang, Zhao. 2018

机译：高效的线性矩阵求解器及其硬件实现，专用于大于大规模电力系统的实时动态仿真
6. Solving dynamical systems in neuromorphic hardware: simulation studies using balanced spiking networks [O] . Anna S Bulanova, Olivier Temam, Rodolphe Heliot 2013

机译：解决神经形态硬件中的动力学系统：使用平衡尖峰网络的仿真研究
7. Solving dense linear systems on platforms with multiple hardware accelerators [O] . Gregorio Quintana-ortı, Francisco D. Igual, Enrique S. Quintana-ortı́, 2009

机译：在具有多个硬件加速器的平台上解决密集线性系统

Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults

摘要

著录项

相似文献

相关主题

期刊订阅