首页> 外文会议>10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing >Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
【24h】

Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

机译:关于软错误的硬数据:GPGPU中实际错误率的大规模评估

获取原文
获取原文并翻译 | 示例

摘要

Graphics processing units (GPUs) are gaining widespread use in high-performance computing because of their performance advantages relative to CPUs. However, the reliability of GPUs is largely unproven. In particular, current GPUs lack error checking and correcting (ECC) in their memory subsystems. The impact of this design has not been previously measured at a large enough scale to quantify soft error events. We present MemtestG80, our software for assessing memory error rates on NVIDIA graphics cards. Furthermore, we present a large-scale assessment of GPU error rate, conducted by running MemtestG80 on over 50,000 hosts on the Folding@home distributed computing network. Our control experiments on consumer-grade and dedicated-GPGPU hardware in a controlled environment found no errors. However, our survey on Folding@home finds that, in their installed environments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitive rate of memory soft errors. We show that these errors persist after controlling for over clocking and environmental proxies for temperature, but depend strongly on board architecture.
机译:图形处理单元(GPU)由于其相对于CPU的性能优势而在高性能计算中得到广泛使用。但是,GPU的可靠性在很大程度上尚未得到证明。特别是,当前的GPU在其内存子系统中缺少错误检查和纠正(ECC)。以前尚未以足够大的规模测量此设计的影响以量化软错误事件。我们提供MemtestG80,这是用于评估NVIDIA显卡上的内存错误率的软件。此外,我们展示了通过在Folding @ home分布式计算网络上的50,000台主机上运行MemtestG80进行的GPU错误率的大规模评估。我们在受控环境中的消费级和专用GPGPU硬件上进行的控制实验未发现任何错误。但是,我们对Folding @ home的调查发现,在安装环境中,三分之二的测试GPU表现出可检测的,模式敏感的内存软错误率。我们表明,在控制超频和温度的环境代理之后,这些错误仍然存​​在,但在很大程度上取决于电路板的架构。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号