首页> 外文期刊>Parallel Computing >GS-DMR: Low-overhead soft error detection scheme for stencil-based computation
【24h】

GS-DMR: Low-overhead soft error detection scheme for stencil-based computation

机译:GS-DMR:用于基于模板的计算的低开销软错误检测方案

获取原文
获取原文并翻译 | 示例

摘要

Soft errors are becoming a prominent problem for massive parallel scientific applications. Dual-modular redundancy (DMR) can provide approximately 100% error coverage, but it has the problem of overhead excessive. Stencil kernel is one of the most important routines applied in the context of structured grids. In this paper, we propose Grid Sampling DMR (GS-DMR), a low-overhead soft error detection scheme for stencil-based computation. Instead of comparing the whole set of the results in the traditional DMR, GS-DMR just compares a subset of the results according to sampling on the grid data, which is based on the error propagation pattern on the grid. We also design a fault tolerant (FT) framework combining GS-DMR with checkpoint technology, and provide theoretical analysis and an algorithm for the optimal FT parameters. Experimental results on Tianhe-2 supercomputer demonstrate that GS-DMR can achieve a good FT effect for stencil-based computation, and the effect is greatly improved for massively parallel applications, reducing the total FT overhead up to 51%.
机译:对于大规模并行科学应用而言,软错误正成为一个突出的问题。双模块冗余(DMR)可以提供大约100%的错误覆盖率,但是存在开销过大的问题。模板内核是在结构化网格环境中应用的最重要的例程之一。在本文中,我们提出了网格采样DMR(GS-DMR),这是一种基于模板的低开销软错误检测方案。 GS-DMR并没有根据传统DMR来比较整个结果集,而只是根据网格数据上的采样根据网格上的错误传播模式对结果的子集进行比较。我们还设计了结合GS-DMR和检查点技术的容错(FT)框架,并提供了理论分析和最优FT参数的算法。在天河2号超级计算机上的实验结果表明,GS-DMR可以在基于模板的计算中实现良好的FT效果,并且在大规模并行应用中效果得到了极大的改善,将总FT开销减少了51%。

著录项

  • 来源
    《Parallel Computing》 |2015年第1期|50-65|共16页
  • 作者单位

    State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, Hunan, China;

    State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, Hunan, China;

    State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, Hunan, China;

    State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, Hunan, China;

    State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, Hunan, China;

    State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, Hunan, China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    GS-DMR; Soft error; Stencil computation; Fault tolerant;

    机译:GS-DMR;软错误;模具计算;容错;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号