首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Toward General Software Level Silent Data Corruption Detection for Parallel Applications
【24h】

Toward General Software Level Silent Data Corruption Detection for Parallel Applications

机译:面向并行应用程序的通用软件级静默数据损坏检测

获取原文
获取原文并翻译 | 示例
           

摘要

Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. Mechanisms have been proposed that are able to detect SDC in HPC applications by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which the lightweight data-analytic detectors would perform poorly. In addition, we propose a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector (instead of just reporting overall single-bit precision and recall). In our experiments, we use four applications dealing with different explosions. Our results indicate that our new approach can protect the MPI applications analyzed with 7-70 percent less overhead (depending on the application) than that of full duplication with similar detection recall.
机译:随着我们转向极限系统,静默数据损坏(SDC)对高性能计算(HPC)应用程序提出了巨大挑战。已经提出了一种机制,该机制能够通过使用数据的特殊性(更具体地说,其在时间和空间上的“平滑度”)进行预测来检测HPC应用程序中的SDC。但是,这些数据分析解决方案仍无法完全保护应用程序,使其达到与更昂贵的解决方案(如完全复制)相当的水平。在这项工作中,我们建议部分复制以克服此限制。更具体地说,我们已经观察到,并非MPI应用程序的所有进程都在完全相同的时间经历相同级别的数据可变性。因此,我们只能明智地选择和复制那些轻量级数据分析检测器性能较差的过程。此外,我们提出了一种新的评估方法,该方法基于特定检测器未注意到损坏通过的可能性(而不仅仅是报告整体单位精度和召回率)。在我们的实验中,我们使用四个应用程序处理不同的爆炸。我们的结果表明,与具有类似检测调用的完全复制相比,我们的新方法可以以7-70%的开销(取决于应用程序)来减少对MPI应用程序的保护。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号