首页> 外文会议>IEEE International Conference on Networking, Architecture and Storage >GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs
【24h】

GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs

机译:GPU-ABFT:使用GPU优化异构系统的基于算法的容错能力

获取原文

摘要

For matrix operations, the algorithm-based fault tolerance (ABFT) brings much lower fault tolerance overhead than the traditional Triple Modular Redundancy or Double Modular Redundancy approaches. Many works have been done to develop and optimize ABFT schemes on general purpose microprocessors. However, the ABFT schemes on heterogeneous systems with GPUs are not fully developed and optimized. Moreover, existing ABFT schemes can correct computing errors brings by the logic parts, however, many memory storage errors cannot be detected and corrected by current ABFT schemes. In this work, we designed a new ABFT scheme with both computing and memory storage protection. Then, we apply it to Cholesky decomposition on heterogeneous systems with GPUs. In addition, we develop several fault tolerance overhead reduction techniques specifically for heterogeneous systems with GPUs accelerators. Experimental results show that our ABFT scheme is able to correct both computing error and memory storage error with low overhead and comparable overall performance.
机译:对于矩阵运算,基于算法的容错(ABFT)带来的容错开销要比传统的三重模块冗余或双重模块冗余方法低得多。为了开发和优化通用微处理器上的ABFT方案,已经进行了许多工作。但是,尚未完全开发和优化具有GPU的异构系统上的ABFT方案。而且,现有的ABFT方案可以纠正逻辑部分带来的计算错误,但是,当前的ABFT方案不能检测和纠正许多存储器存储错误。在这项工作中,我们设计了一种具有计算和内存存储保护功能的新ABFT方案。然后,将其应用于具有GPU的异构系统上的Cholesky分解。此外,我们针对具有GPU加速器的异构系统开发了几种容错开销减少技术。实验结果表明,我们的ABFT方案能够以较低的开销和相当的总体性能来纠正计算错误和内存存储错误。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号