首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >VOCL-FT: introducing techniques for efficient soft error coprocessor recovery
【24h】

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

机译:VOCL-FT:引入有效的软错误协处理器恢复技术

获取原文

摘要

Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient fault-tolerant system for accelerators. Although we leverage our techniques to protect from detected but uncorrected ECC errors in the device memory in OpenCL-accelerated applications, coprocessor reliability solutions based on different error detectors and similar API semantics can directly adopt the techniques we propose. Adding error detection and protection involves a tradeoff between runtime overhead and recovery time. Although optimal configurations depend on the particular application, the length of the run, the error rate, and the temporary storage speed, our test cases reveal a good balance with significantly reduced runtime overheads.
机译:流行的加速器编程模型依赖于分流计算操作及其将相应​​的数据传输到协处理器,并在需要时利用同步点。在本文中,我们确定并探索了这种编程模型如何实现传统检查点/重新启动系统中未利用的优化机会,并且将它们分析为加速器高效容错系统的基础。尽管我们利用我们的技术来防止OpenCL加速应用程序中的设备内存中检测到但未纠正的ECC错误,但是基于不同错误检测器和类似API语义的协处理器可靠性解决方案可以直接采用我们提出的技术。添加错误检测和保护需要在运行时开销和恢复时间之间进行权衡。尽管最佳配置取决于特定的应用程序,运行时间,错误率和临时存储速度,但我们的测试用例显示出良好的平衡,并显着减少了运行时开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号