首页> 外文会议>IEEE International Symposium on Parallel Distributed Processing >A high-performance fault-tolerant software framework for memory on commodity GPUs
【24h】

A high-performance fault-tolerant software framework for memory on commodity GPUs

机译:用于商品GPU的内存高性能容错软件框架

获取原文

摘要

As GPUs are increasingly used to accelerate HPC applications by allowing more flexibility and programmability, their fault tolerance is becoming much more important than before when they were used only for graphics. The current generation of GPUs, however, does not have standard error detection and correction capabilities, such as SEC-DED ECC for DRAM, which is almost always exercised in HPC servers. We present a high-performance software framework to enhance commodity off-the-shelf GPUs with DRAM fault tolerance. It combines data coding for detecting bit-flip errors and checkpointing for recovering computations when such errors are detected. We analyze performance of data coding in GPUs and present optimizations geared toward memory-intensive GPU applications. We present performance studies of the prototype implementation of the framework and show that the proposed framework can be realized with negligible overheads in compute intensive applications such as N-body problem and matrix multiplication, and as low as 35% in a highly-efficient memory intensive 3-D FFT kernel.
机译:由于GPU越来越多地用于通过允许更大的灵活性和可编程性来加速HPC应用,因此它们的容错变得比以前更重要,因为它们仅用于图形。然而,目前GPU的产生没有标准错误检测和校正能力,例如SEC-DED ECC用于DRAM,其几乎始终在HPC服务器中锻炼。我们提出了一个高性能的软件框架,以通过DRAM容错来增强商品的商品。它结合了数据编码,以检测位翻转错误并检查检测到这些错误时恢复计算的检查点。我们分析GPU中数据编码的性能,并提供对内存密集型GPU应用的优化。我们呈现了对框架的原型实施的性能研究,并表明所提出的框架可以在计算密集型应用(如N体问题和矩阵乘法)中具有可忽略的开销,并且在高效的记忆密集型中低至35% 3-D FFT内核。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号