首页> 外文学位 >Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods
【24h】

Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods

机译:在使用线性代数方法的应用中通过不变检查实现容错

获取原文
获取原文并翻译 | 示例

摘要

Graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra. As the minimum feature size of transistors decreases, GPUs are becoming more vulnerable to transient faults caused by events such as alpha particle strikes, power fluctuations and electronic noise. In addition, the likelihood of a fault increases as more GPU computing nodes are used in supercomputers to meet the increasingly demanding computational requirements of scientific applications. Consequently, there are concerns that GPU-based supercomputer systems will suffer from very high fault rates. In order to ensure reliability, it is necessary to use fault tolerance (FT) techniques.;This thesis presents low-overhead FT techniques for several commonly-used linear algebraic applications that run on GPUs, focusing mainly on applications that operate with sparse matrices. These FT techniques exploit the invariant properties of the algorithms used in these applications, and exploit the parallel execution model of GPUs to allow for low-overhead error detection.;This thesis introduces and studies efficient error checking schemes for three popular matrix factorization techniques: Householder QR factorization, left-looking Cholesky factorization, and right-looking LU factorization. It also explores lightweight invariant checking methods for the preconditioned conjugate gradient (PCG) and biconjugate gradient stabilized (BiCGSTAB) iterative solvers and introduces an efficient checking method for the Lanczos eigensolver, as well as fault injection mechanisms for NVIDIA GPUs that allow for the simulation of transient, non-instantaneous faults.;This thesis carefully evaluates these FT methods on a contemporary NVIDIA GPU platform, and the results show that the aforementioned error checking strategies have high error coverage and are significantly more efficient than prior FT techniques on a GPU system.
机译:图形处理单元(GPU)已成为科学计算应用程序的流行平台,其中许多都是基于线性代数的。随着晶体管的最小特征尺寸减小,GPU变得更容易受到由alpha粒子撞击,功率波动和电子噪声等事件引起的瞬态故障的影响。另外,随着在超级计算机中使用更多的GPU计算节点来满足科学应用日益增长的计算要求,出现故障的可能性也会增加。因此,人们担心基于GPU的超级计算机系统将遭受很高的故障率。为了确保可靠性,有必要使用容错(FT)技术。本文针对在GPU上运行的几种常用线性代数应用程序提供了低开销的FT技术,主要侧重于使用稀疏矩阵的应用程序。这些FT技术利用了这些应用程序中算法的不变性,并利用GPU的并行执行模型来进行低开销的错误检测。本文针对三种流行的矩阵分解技术介绍并研究了有效的错误检查方案:Householder QR分解,左眼Cholesky分解和右眼LU分解。它还探索了预处理共轭梯度(PCG)和双共轭梯度稳定(BiCGSTAB)迭代求解器的轻量级不变检查方法,并介绍了Lanczos本征求解器的有效检查方法,以及针对NVIDIA GPU的故障注入机制,可用于模拟本文在现代NVIDIA GPU平台上仔细评估了这些FT方法,结果表明,上述错误检查策略具有较高的错误覆盖率,并且比GPU系统上的现有FT技术有效得多。

著录项

  • 作者

    Loh, Felix Da Yuan.;

  • 作者单位

    The University of Wisconsin - Madison.;

  • 授予单位 The University of Wisconsin - Madison.;
  • 学科 Computer engineering.;Electrical engineering.;Computer science.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 150 p.
  • 总页数 150
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号