首页> 外文会议>IEEE International Test Conference >Soft error resiliency characterization and improvement on IBM BlueGene/Q processor using accelerated proton irradiation

Soft error resiliency characterization and improvement on IBM BlueGene/Q processor using accelerated proton irradiation

机译:使用加速质子辐照在IBM BlueGene / Q处理器上进行软错误恢复能力表征和改进



Fault injection through accelerated irradiation is an effective way to evaluate the overall soft error resiliency of microprocessors. In this work, we report on irradiation experiments on a Blue Gene/Q (BG/Q) compute processor chip running selected applications. Blue Gene/Q is the third generation of IBM's massively parallel, energy efficient Blue Gene series of supercomputers. In the experiments, we found 69 code fails. Out of these, 26 code fails are relevant for the calculation of the mean-time-between-failures (MTBF) for a 20 PetaFLOP, 96 rack system running a comparable workload mix. The expected MTBF for check-stops due to cosmic radiation and alpha particles from chip packaging materials is calculated to be 51 days for sea-level at New York City running the application mix studied. If the most vulnerable application is run exclusively, the projected MTBF is 35 days. These are outstanding results for a machine of this magnitude. The beaming experiment and projected MTBF validate the necessity to include autonomous hardware detection and recovery at the cost of design effort, silicon area and power.
机译:通过加速辐射进行故障注入是评估微处理器总体软错误恢复能力的有效方法。在这项工作中,我们报告了在运行选定应用程序的Blue Gene / Q(BG / Q)计算处理器芯片上进行的辐照实验。 Blue Gene / Q是IBM大规模并行,节能的Blue Gene系列超级计算机的第三代产品。在实验中,我们发现69条代码失败。其中,有26个代码失败与20 PetaFLOP,96机架系统运行类似的工作负载混合时的平均故障间隔时间(MTBF)有关。计算得出的运行应用混合物的纽约市,由于宇宙辐射和芯片包装材料中的α粒子而导致的检查停止的预期平均无故障时间为51天。如果最容易受到攻击的应用程序是专门运行的,则预计的MTBF为35天。对于如此规模的机器,这些都是出色的结果。光束测试和预计的MTBF证明了以设计工作量,芯片面积和功耗为代价包括自动硬件检测和恢复的必要性。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号