...
首页> 外文期刊>IEEE Transactions on Nuclear Science >Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications
【24h】

Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

机译:HPC和大型服务器应用的排序算法错误严重性的实验和分析分析

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the user-accessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.
机译:在本文中,我们研究了在为高性能计算和大型服务器应用而设计的现代图形处理单元上执行的三种排序算法(QuickSort,MergeSort和RadixSort)的实现中中子引起的误差。我们利用可在洛斯阿拉莫斯中子科学中心设施中获得的中子束来测量归类算法的辐射诱发错误率。我们还通过识别特定的输出错误模式来分析输出错误的严重性。我们发现辐射会导致错误的元素出现在已排序的数组中,值未对齐以及应用程序崩溃或系统挂起。本文提出的结果表明,辐射引起的输出误差模式的严重程度取决于应用。此外,已经进行了广泛的故障注入活动。该活动可以更好地了解观察到的现象。我们利用了NVIDIA开发的SASS-assembly Intrumentator Fault Injector,它可以将故障注入所有用户可访问的架构状态。将故障注入结果与辐射实验数据进行比较可以理解,并非所有在辐射下观察到的输出误差都可以在故障注入中复制。但是,故障注入对于确定在辐射测试中观察到的输出错误的可能根本原因很有用。最后,我们利用我们的实验和分析研究来设计有效的实验调整硬化策略。我们检测对最终应用至关重要的错误模式,并找到更有效的方法来检测它们。只需低至执行时间16%的开销,我们就能将输出错误率降低大约一个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号