首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications


获取原文并翻译 | 示例


For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99 percent of SDCs with a false alarm rate less that 1 percent of iterations for most cases. The memory cost and detection overhead are reduced to 15 and 6.3 percent, respectively, for a large majority of applications.
机译:对于Exascale HPC应用程序,静默数据损坏(SDC)是最危险的问题之一,因为没有迹象表明执行期间存在错误。我们提出了一种自适应影响驱动的方法,可以动态检测SDC。关键贡献是三方面的。 (1)我们仔细表征18个HPC应用程序/基准,并讨论运行时数据功能,以及SDC对它们执行结果的影响。 (2)我们提出了一种影响驱动的检测模型,该模型不会盲目地提高预测准确性,而是仅检测有影响力的SDC以保证用户可接受的执行结果。 (3)我们的解决方案可以根据本地运行时数据适应动态预测错误,并可以自动调整检测范围,以确保较低的误报率。实验表明,在大多数情况下,我们的检测器可以检测到80-99.99%的SDC,误报率小于迭代次数的1%。对于大多数应用程序,内存成本和检测开销分别降低到15%和6.3%。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号