首页> 外文期刊>IEEE Transactions on Computers >A routing methodology for achieving fault tolerance in direct networks
【24h】

A routing methodology for achieving fault tolerance in direct networks

机译:在直接网络中实现容错的路由方法

获取原文
获取原文并翻译 | 示例

摘要

Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance.
机译:大量的并行计算系统正在建立有数千个节点。互连网络对于此类系统的性能起着关键作用。但是,大量的组件显着增加了故障的可能性。另外,互连网络中的故障可能会隔离很大一部分机器。因此,即使在出现故障的情况下,提供有效的容错机制以保持系统运行也至关重要。本文提出了一种新的容错路由方法,该方法在不存在故障的情况下不会降低性能,并且可以在不禁用任何健康节点的情况下容忍大量的故障。为了避免故障,对于某些源-目标对,首先将数据包发送到中间节点,然后再从该节点发送到目标节点。沿两个子路径使用完全自适应路由。该方法假设一个静态故障模型并使用检查点/重启机制。但是,在某些情况下,仅使用中间节点无法避免故障。因此,我们还提供了对该方法的一些扩展。具体而言,我们建议在每个数据包的基础上禁用自适应路由和/或使用错误路由。我们还建议对某些路径使用不止一个中间节点。所提出的容错路由方法在容错性,复杂性和性能方面得到了广泛的评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号