首页> 外文学位 >From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators.
【24h】

From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators.

机译:从实验到设计-使用计算加速器在并行计算机系统中进行故障表征和检测。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7).;The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components.;This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
机译:本文总结了为优化混合计算机系统(例如,使用CPU和图形处理单元或GPU)的故障检测能力和开销而进行的实验验证和协同设计研究,从而提高了使用计算加速器的并行计算机系统的可伸缩性。进行了实验验证研究,以帮助我们了解CPU-GPU混合计算机系统在各种类型的硬件故障下的故障特征。主要特征指标是难以检测和/或从中恢复的故障,例如,导致长等待时间故障的故障(第3章),动态分配的资源的故障(第4章),GPU的故障(第5章) ,MPI程序中的故障(第6章)以及具有特定计时功能的微体系结构级故障(第7章)。共同设计研究基于表征结果。其中一个共同设计的系统具有一组源到源转换器,这些转换器可自定义错误检测器并将其策略性地放置在目标GPU程序的源代码中(第5章)。另一个共同设计的系统使用扩展卡来学习在CPU上执行的消息传递进程的正常行为和语义执行模式,并检测那些并行进程的异常行为(第6章)。第三个共同设计的系统是一个具有一组新指令的协处理器,以支持软件实现的故障检测技术(第7章)。由于异质处理器已成为一种新的应用,本论文中描述的工作变得更加重要。最先进的超级计算机的基本组件。 2011年,在运行速度最快的五台超级计算机中,有三台使用了GPU。我们的工作包括对CPU-GPU混合计算机进行全面的故障表征研究。在CPU中,我们在注入故障后(长时间的综合性实验)对目标系统进行了长时间监控,并将故障注入到各种类型的程序状态中,这些状态包括动态分配的内存(在空间上是全面的)。在GPU中,我们使用故障注入研究来证明检测静默数据损坏(SDC)错误的重要性,这主要是由于缺少细粒度的保护以及对故障不敏感数据的大量使用。本文还提出了透明的容错框架和技术,这些框架和技术可直接应用于仅使用商用现货硬件组件构建的混合计算机。本文表明,通过对目标程序的故障特征和错误传播路径的理解,我们可以能够创建容错框架和技术,以较低的性能和硬件开销快速检测并从硬件故障中恢复。

著录项

  • 作者

    Yim, Keun Soo.;

  • 作者单位

    University of Illinois at Urbana-Champaign.;

  • 授予单位 University of Illinois at Urbana-Champaign.;
  • 学科 Computer Science.;Engineering Aerospace.;Statistics.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 220 p.
  • 总页数 220
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号