首页> 外文会议>IEEE International Conference on Parallel and Distributed Systems >Extending checksum-based ABFT to tolerate soft errors online in iterative methods
【24h】

Extending checksum-based ABFT to tolerate soft errors online in iterative methods

机译:以迭代方法扩展基于校验和的ABFT在线致力于在线在线

获取原文

摘要

As the size and complexity of high performance computers increase, more soft errors will be encountered during computations. Algorithm-Based Fault Tolerance (ABFT) has been proved to be a highly efficient technique to detect soft errors in dense linear algebra operations including matrix multiplication, Cholesky and LU factorization. While ABFT can also be applied to a iterative sparse linear algebra algorithm via applying it to every individual matrix-vector multiplication in the algorithm, it often introduces considerable overhead. In this paper, we propose novel extensions to ABFT to not only reduce the overhead but also protect computations that can not be protected by existing ABFT. Instead of maintaining checksums in every individual matrix-vector multiplication, we modified the algorithms so that checksums established at the beginning of the algorithms can be maintained at every iterations throughout the algorithms. Because soft errors in most iterative sparse linear algebra algorithms will propagate from one iteration to another, we do not have to verify the correctness of the checksums at each iteration to detect errors. By reducing the frequency of verification, the fault tolerance overhead can be greatly reduced. Experimental results demonstrate that, when used with local diskless checkpoints together, our approach introduces much less overhead than the existing ABFT techniques.
机译:随着高性能计算机的尺寸和复杂性增加,计算期间将遇到更软的误差。基于算法的容错(ABFT)被证明是一种高效的技术,用于检测致密线性代数运算中的软误差,包括矩阵乘法,Cholesky和LU分解。虽然ABFT也可以通过将其应用于算法中的每个单独的矩阵矢量乘法来应用于迭代稀疏线性代数算法,但它通常介绍相当大的开销。在本文中,我们向ABFT提出了新的扩展,而不仅可以减少开销,而且保护不受现有ABFT无法保护的计算。不是在每个单独的矩阵矢量乘法中维护校验和,而是修改了算法,以便在整个算法中的每个迭代处可以维护在算法开始时建立的校验和。由于大多数迭代稀疏线性代数算法中的软错误将从一个迭代传播到另一个迭代,因此我们不必验证校验和在每次迭代时的正确性以检测错误。通过降低验证频率,可以大大减少容错开销。实验结果表明,当与本地无盘检查点一起使用时,我们的方法引入了比现有的ABFT技术更少的开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号