【24h】

A New Fault-Tolerant Routing Methodology for KNS Topologies

机译:KNS拓扑的新的容错路由方法

获取原文
获取原文并翻译 | 示例

摘要

Exascale computing systems are being built with thousands of nodes. A key component of these systems is the interconnection network. The high number of components significantly increases the probability of failure. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid KNS family that provides supreme performance and connectivity at a reduced hardware cost. This paper present a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in the presence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to tolerate network failures, the methodology uses a simple mechanism: for some source-destination pairs, only if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) which allow avoiding faults. The evaluation results shows that the methodology tolerates a large number of faults. Furthermore, the methodology offers a gracious performance degradation. For instance, performance degrades only 1% for a 2D-network with 1024 nodes and 1% faulty links.
机译:Exascale计算系统正在建立有数千个节点。这些系统的关键组成部分是互连网络。大量的组件大大增加了故障的可能性。如果互连网络中发生故障,则它们可能会隔离大部分机器。因此,即使在存在故障的情况下,也需要一种有效的容错机制来保持系统互连。这些大型系统最近提出的拓扑是混合KNS系列,它以降低的硬件成本提供了卓越的性能和连接性。本文提出了一种用于KNS拓扑的容错路由方法,该方法在存在错误的情况下会适度降低性能,并在不禁用任何健康节点的情况下容忍大量的错误。为了容忍网络故障,该方法使用一种简单的机制:对于某些源-目标对,仅在必要时,才将数据包通过一组中间节点(不从网络中弹出)转发到目标节点,从而避免出现故障。 。评估结果表明,该方法可以容忍大量故障。此外,该方法会降低性能。例如,对于具有1024个节点和1%错误链路的2D网络,性能只会降低1%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号