...
首页> 外文期刊>IEEE Transactions on Computers >Algorithm-based fault location and recovery for matrix computations on multiprocessor systems
【24h】

Algorithm-based fault location and recovery for matrix computations on multiprocessor systems

机译:基于算法的故障定位和恢复,用于多处理器系统上的矩阵计算

获取原文
获取原文并翻译 | 示例

摘要

Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases.
机译:基于算法的容错(ABFT)是将容错合并到现有应用程序中的廉价方法。修改应用程序以对编码数据进行操作并产生编码结果,然后可以检查编码结果的正确性。该方案的一个吸引人的特征是它几乎不需要或根本不需要对底层硬件或系统软件进行修改。为通用多计算机开发可靠版本的数字程序的基于算法的先前方法,大多数情况下都与错误检测有关。但是,真正的容错算法需要找到错误并在找到错误后从错误中恢复。在并行处理环境中,这对应于定位故障处理器并恢复被故障处理器破坏的数据。在本文中,我们首先提出一种在ABFT框架下执行故障定位和恢复的通用方案。我们的故障模型假设故障处理器可以破坏其拥有的所有数据。故障定位方案是系统级诊断理论在ABFT框架中的一种应用,而故障恢复方案则使用了编码理论中的思想来维护冗余数据,并在处理器故障时使用它来恢复损坏的数据。给出了在16处理器Intel iPSC / 2超立方体多计算机上执行三种数值算法的结果,这些结果证明了单故障和双故障定位和恢复情况的可接受的低开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号