首页> 外文会议>International conference on parallel and distributed comuting >Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers
【24h】

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers

机译:常微分方程求解器的轻量级且精确的静默数据损坏检测

获取原文

摘要

Silent data corruptions (SDCs) are errors that corrupt the system or falsify results while remaining unnoticed by firmware or operating systems. In numerical integration solvers, SDCs that impact the accuracy of the solver are considered significant. Detecting SDCs in high-performance computing is necessary because results need to be trustworthy and the increase of the number and complexity of components in emerging large-scale architectures makes SDCs more likely to occur. Until recently, SDC detection methods consisted in replicating the processes of the execution or in using checksums (for example algorithm-based fault tolerance). Recently, new detection methods have been proposed relying on mathematical properties of numerical kernels or performing data analysis of the results modified by the application. None of those methods, however, provide a lightweight solution guaranteeing that all significant SDCs are detected. We propose a new method called Hot Rod as a solution to this problem. It checks and potentially corrects the data produced by numerical integration solvers. Our theoretical model shows that all significant SDCs can be detected. We present two detectors and conduct experiments on streamline integration from the WRF meteorology application. Compared with the algorithmic detection methods, the accuracy of our first detector is increased by 52 % with a similar false detection rate. The second detector has a false detection rate one order of magnitude lower than these detection methods while improving the detection accuracy by 23%. The computational overhead is lower than 5% in both cases. The model has been developed for an explicit Runge-Kutta method, although it can be generalized to other solvers.
机译:静默数据损坏(SDC)是导致系统损坏或伪造结果而又不被固件或操作系统注意的错误。在数值积分求解器中,影响求解器精度的SDC被认为是重要的。在高性能计算中检测SDC是必要的,因为结果必须可信,并且新兴的大型体系结构中组件数量和复杂性的增加使得SDC的发生可能性更大。直到最近,SDC检测方法还包括复制执行过程或使用校验和(例如,基于算法的容错能力)。最近,已经提出了新的检测方法,该方法依赖于数值核的数学特性或对应用程序修改后的结果进行数据分析。但是,这些方法都没有提供一种轻量级的解决方案来保证检测到所有重要的SDC。我们提出了一种称为“热棒”的新方法来解决此问题。它检查并可能纠正由数值积分求解器产生的数据。我们的理论模型表明可以检测到所有重要的SDC。我们介绍了两个探测器,并从WRF气象应用程序进行了流线集成实验。与算法检测方法相比,我们的第一个检测器的准确率提高了52%,错误检测率相近。第二检测器具有比这些检测方法低一个数量级的错误检测率,同时将检测精度提高了23%。在两种情况下,计算开销均低于5%。尽管可以将其推广到其他求解器,但该模型已为显式Runge-Kutta方法开发。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号