首页> 外文会议>International Reliability Physics Symposium >Software-based dynamic reliability management for GPU applications
【24h】

Software-based dynamic reliability management for GPU applications

机译:针对GPU应用的基于软件的动态可靠性管理

获取原文

摘要

In this paper we propose a framework for dynamic reliability management (DRM) for GPU applications based on the idea of plug-n-play software-based reliability enhancement (SRE). The approach entails first assessing the vulnerability of GPU kernels to soft errors in program visible structures. This assessment is performed on a low level intermediate program representation rather than the application source. Second, this assessment guides selective injection of code implementing SRE techniques to protect the most vulnerable data. Code injection occurs transparently at runtime using a just-in-time (JIT) compiler. Thus, reliability enhancement is selective, transparent, on-demand, and customizable. This flexible, automated software-based DRM framework can provide an adaptable, cost-effective approach to scaling reliability of large systems. We present the results of a proof of concept implementation on NVIDIA GPUs demonstrating the ability to traverse a range of performance reliability tradeoffs.
机译:在本文中,我们基于基于即插即用软件的可靠性增强(SRE)的思想,提出了GPU应用程序的动态可靠性管理(DRM)框架。该方法需要首先评估GPU内核对程序可见结构中的软错误的脆弱性。此评估是在较低级别的中间程序表示而不是应用程序源上执行的。其次,此评估指导选择性注入实施SRE技术的代码,以保护最脆弱的数据。使用实时(JIT)编译器在运行时透明地进行代码注入。因此,可靠性增强是选择性的,透明的,按需的和可定制的。这种灵活的,基于软件的自动化自动化DRM框架可以提供一种可扩展的,具有成本效益的方法来扩展大型系统的可靠性。我们展示了在NVIDIA GPU上实施概念验证的结果,证明了穿越性能可靠性折衷范围的能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号