...
首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Graceful degradation in algorithm-based fault tolerant multiprocessor systems
【24h】

Graceful degradation in algorithm-based fault tolerant multiprocessor systems

机译:基于算法的容错多处理器系统中的性能下降

获取原文
获取原文并翻译 | 示例

摘要

Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t=1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system.
机译:基于算法的容错(ABFT)是一种通过提供并发错误检测和故障定位功能来提高多处理器系统可靠性的技术。它在系统级别对数据进行编码,并修改算法以对编码的数据进行操作,以暴露任何处理器中的瞬时故障和永久故障。到目前为止,在该区域中完成的工作仅涉及问题的故障检测和定位部分。但是,如果没有备用处理器,则在找到故障处理器之后,必须以保持系统的容错能力的方式将最初分配给它的工作映射到系统中的一些无故障处理器。尽可能降低性能。在本文中,我们针对上述问题提出了一种综合的确定性解决方案,该解决方案将并发错误检测和故障定位与平稳降级相结合。即使在t = 1的情况下,也没有用于一般t故障定位系统设计的先前确定性ABFT方法。我们提出了一种用于设计单故障定位/ S故障检测系统的通用方法。我们使用扩展模型来表示ABFT系统。该模型将计算支票的处理器视为ABFT系统的一部分,因此也可以使用简单的诊断算法检测和定位支票计算处理器中的故障,并将支票映射到系统中的其他无故障处理器。 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号