Run-Time Recovery Mechanism for Transient and Permanent Hardware Faults Based on Distributed, Self-Organized Dynamic Partially Reconfigurable Systems

Victor Dumitriu; Lev Kirischian; Valeri Kirischian

首页> 外文期刊>IEEE Transactions on Computers >Run-Time Recovery Mechanism for Transient and Permanent Hardware Faults Based on Distributed, Self-Organized Dynamic Partially Reconfigurable Systems

【24h】

Run-Time Recovery Mechanism for Transient and Permanent Hardware Faults Based on Distributed, Self-Organized Dynamic Partially Reconfigurable Systems

机译：基于分布式自组织动态部分可重配置系统的暂时性和永久性硬件故障的运行时恢复机制

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

AI期刊论文写作 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Field-Programmable Gate Arrays (FPGAs) are rapidly gaining popularity as implementation platforms for complex space-borne computing systems. However, such systems are exposed to cosmic radiation with levels orders of magnitude higher than terrestrial levels which can cause transient and even permanent hardware faults in on-board computing platforms. Because of this, development of effective fault mitigation methods and self-repair mechanisms has become a vital aspect for FPGA-based space-borne computing platforms. This work presents a novel method for transient and permanent fault mitigation and run-time fault recovery for commercial-grade FPGA devices with partially reconfigurable tile-based architectures. The proposed method ensures the same pre-determined recovery time for transient and permanent hardware faults through dynamic on-chip component relocation regardless of the fault type. The method makes use of fully distributed control, communication, self-synchronization and self-integration mechanisms embedded in each on-chip hardware component. Run-time collaboration between components provides relocation & fault mitigation procedures. The distributed nature of the above mechanisms excludes most central failure points which could cause non-restorable system faults. This method has been implemented, tested and verified on a Xilinx Kintex-7 FPGA platform. Results show that the proposed method is significantly more resource efficient when compared with Triple-Module Redundancy or central, software-based control mechanisms.

机译：现场可编程门阵列（FPGA）作为复杂的航天计算系统的实现平台正在迅速普及。但是，此类系统暴露于宇宙辐射中，其辐射水平要比地面辐射水平高几个数量级，这可能会导致车载计算平台中出现瞬时甚至永久性的硬件故障。因此，开发有效的故障缓解方法和自我修复机制已成为基于FPGA的星载计算平台的重要方面。这项工作提出了一种新颖的方法，用于具有部分可重新配置的基于图块的架构的商业级FPGA器件的瞬态和永久性故障缓解以及运行时故障恢复。所提出的方法可通过动态片上组件重定位来确保瞬态和永久性硬件故障具有相同的预定恢复时间，而与故障类型无关。该方法利用嵌入在每个片上硬件组件中的完全分布式控制，通信，自同步和自集成机制。组件之间的运行时协作提供了重新定位和故障缓解程序。上述机制的分布式性质不包括大多数中央故障点，这些故障点可能会导致不可恢复的系统故障。该方法已在Xilinx Kintex-7 FPGA平台上实现，测试和验证。结果表明，与三重模块冗余或基于软件的中央控制机制相比，该方法的资源效率更高。

著录项

来源
《IEEE Transactions on Computers》 |2016年第9期|2835-2847|共13页
作者
Victor Dumitriu; Lev Kirischian; Valeri Kirischian;
展开▼
作者单位

Department of Electrical and Computer Engineering, Ryerson University, Toronto, Canada;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Fault tolerance; field programmable gate arrays; reconfigurable architectures; system-on-chip;

机译：容错;现场可编程门阵列;可重配置架构;片上系统;

相似文献

外文文献
中文文献
专利

1. Recovery in Distributed Systems from Transient and Permanent Faults | Science Publications [J] . M. Aliouat, Z. Aliouat Journal of computer sciences . 2007,第8期

机译：从暂态和永久故障中恢复分布式系统科学出版物
2. Model-based platform-specific co-design methodology for dynamically partially reconfigurable systems with hardware virtualization and preemption [J] . Huang C.-H., Hsiung P.-A., Shen J.-S. Journal of systems architecture . 2010,第11期

机译：基于模型的特定于平台的协同设计方法，用于具有硬件虚拟化和抢占的动态部分可重配置的系统
3. UML-based hardware/software co-design platform for dynamically partially reconfigurable network security systems [J] . Huang CH, Hsiung PA, Shen JS Journal of systems architecture . 2010,第2a3期

机译：基于UML的硬件/软件协同设计平台，用于动态部分可重新配置的网络安全系统
4. Decentralized run-time recovery mechanism for transient and permanent hardware faults for space-borne FPGA-based computing systems [C] . Dumitriu Victor, Kirischian Lev, Kirischian Valeri NASA/ESA Conference on Adaptive Hardware and Systems . 2014

机译：基于星载FPGA的计算系统的瞬态和永久性硬件故障的分散式运行时恢复机制
5. Robust integration of multi-level fault detection mechanisms and recovery mechanisms in a component-based support middleware model for fault-tolerant real-time distributed computing. [D] . Zhou, Qian. 2009

机译：多级故障检测机制和恢复机制在基于组件的支持中间件模型中的可靠集成，用于容错实时分布式计算。
6. An Uncertainty-Based Distributed Fault Detection Mechanism for Wireless Sensor Networks [O] . Yang Yang, Zhipeng Gao, Hang Zhou, 2014

机译：基于不确定度的无线传感器网络分布式故障检测机制
7. Recovery in Distributed Systems from Transient and Permanent Faults [O] . M. Aliouat, Z. Aliouat 2007

机译：从瞬时故障和永久故障中恢复分布式系统

Run-Time Recovery Mechanism for Transient and Permanent Hardware Faults Based on Distributed, Self-Organized Dynamic Partially Reconfigurable Systems

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅