首页> 外文学位 >Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods

【24h】

Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods

机译：在使用线性代数方法的应用中通过不变检查实现容错

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra. As the minimum feature size of transistors decreases, GPUs are becoming more vulnerable to transient faults caused by events such as alpha particle strikes, power fluctuations and electronic noise. In addition, the likelihood of a fault increases as more GPU computing nodes are used in supercomputers to meet the increasingly demanding computational requirements of scientific applications. Consequently, there are concerns that GPU-based supercomputer systems will suffer from very high fault rates. In order to ensure reliability, it is necessary to use fault tolerance (FT) techniques.;This thesis presents low-overhead FT techniques for several commonly-used linear algebraic applications that run on GPUs, focusing mainly on applications that operate with sparse matrices. These FT techniques exploit the invariant properties of the algorithms used in these applications, and exploit the parallel execution model of GPUs to allow for low-overhead error detection.;This thesis introduces and studies efficient error checking schemes for three popular matrix factorization techniques: Householder QR factorization, left-looking Cholesky factorization, and right-looking LU factorization. It also explores lightweight invariant checking methods for the preconditioned conjugate gradient (PCG) and biconjugate gradient stabilized (BiCGSTAB) iterative solvers and introduces an efficient checking method for the Lanczos eigensolver, as well as fault injection mechanisms for NVIDIA GPUs that allow for the simulation of transient, non-instantaneous faults.;This thesis carefully evaluates these FT methods on a contemporary NVIDIA GPU platform, and the results show that the aforementioned error checking strategies have high error coverage and are significantly more efficient than prior FT techniques on a GPU system.

机译：图形处理单元（GPU）已成为科学计算应用程序的流行平台，其中许多都是基于线性代数的。随着晶体管的最小特征尺寸减小，GPU变得更容易受到由alpha粒子撞击，功率波动和电子噪声等事件引起的瞬态故障的影响。另外，随着在超级计算机中使用更多的GPU计算节点来满足科学应用日益增长的计算要求，出现故障的可能性也会增加。因此，人们担心基于GPU的超级计算机系统将遭受很高的故障率。为了确保可靠性，有必要使用容错（FT）技术。本文针对在GPU上运行的几种常用线性代数应用程序提供了低开销的FT技术，主要侧重于使用稀疏矩阵的应用程序。这些FT技术利用了这些应用程序中算法的不变性，并利用GPU的并行执行模型来进行低开销的错误检测。本文针对三种流行的矩阵分解技术介绍并研究了有效的错误检查方案：Householder QR分解，左眼Cholesky分解和右眼LU分解。它还探索了预处理共轭梯度（PCG）和双共轭梯度稳定（BiCGSTAB）迭代求解器的轻量级不变检查方法，并介绍了Lanczos本征求解器的有效检查方法，以及针对NVIDIA GPU的故障注入机制，可用于模拟本文在现代NVIDIA GPU平台上仔细评估了这些FT方法，结果表明，上述错误检查策略具有较高的错误覆盖率，并且比GPU系统上的现有FT技术有效得多。

著录项

作者
Loh, Felix Da Yuan.;
展开▼
作者单位

The University of Wisconsin - Madison.;

展开▼
授予单位 The University of Wisconsin - Madison.;
学科 Computer engineering.;Electrical engineering.;Computer science.
学位 Ph.D.
年度 2018
页码 150 p.
总页数 150
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. MODIFICATION OF THE WEIGHTED CHECKSUM METHOD FOR DERIVING FAULT TOLERANT VERSIONS OF THE MAIN LINEAR ALGEBRA ALGORITHMS [J] . Maslennikov Oleg Computational Methods in Science and Technologygy . 2002,第1期

机译：求主线性代数算法的容错版本的加权Chechksum方法的修改
2. Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications [J] . Nuria Losada, María J. Martín, Gabriel Rodríguez, Journal of Universal Computer Science . 2014,第9期

机译：扩展应用程序级检查点工具以为OpenMP应用程序提供容错支持
3. Self-checking method for fault tolerance solution in wireless sensor network [J] . Muayad Sadik Croock, Saja Dhyaa Khuder, Zahraa Abbas Hassan International Journal of Electrical and Computer Engineering . 2020,第4期

机译：无线传感器网络中容错解决方案的自检方法
4. Fault Tolerance through Invariant Checking for the Lanczos Eigensolver [C] . Felix Loh, Kewal K. Saluja, Parameswaran Ramanathan International Conference on VLSI Design;International Conference on Embedded Systems . 2020

机译：通过不变检查Lanczos特征求解器来实现容错
5. Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems. [D] . Hursey, Joshua. 2010

机译：HPC系统上MPI应用程序的协调检查点/重启过程容错能力。
6. Data driven linear algebraic methods for analysis of molecular pathways: application to disease progression in shock/trauma [O] . Mary F. McGuire, M. Sriram Iyengar, David W. Mercer -1

机译：分析分子途径分析的数据驱动的线性代数方法：抗冲击/创伤中疾病进展的应用
7. MODIFICATION OF THE WEIGHTED CHECKSUM METHOD FOR DERIVING FAULT TOLERANT VERSIONS OF THE MAIN LINEAR ALGEBRA ALGORITHMS [O] . 2001

机译：加权CHECKSUM方法的推导主要线性代数算法的容错版本

Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods

摘要

著录项

相似文献

相关主题

期刊订阅