【24h】

Network Fault Tolerance in Open MPI

机译:打开MPI中的网络容错

获取原文

摘要

High Performance Computing (HPC) systems are rapidly growing in size and complexity. As a result, transient and persistent network failures can occur on the time scale of application run times, reducing the productive utilization of these systems. The ubiquitous network protocol used to deal with such failures is TCP/IP, however, available implementations of this protocol provide unacceptable performance for HPC system users, and do not provide the high bandwidth, low latency communications of modern interconnects. This paper describes methods used to provide protection against several network errors such as dropped packets, corrupt packets, and loss of network interfaces while maintaining high-performance communications. Micro-benchmark experiments using vendor supplied TCP/IP and O/S bypass low-level communications stacks over InfiniBand and Myrinet are used to demonstrate the high-performance characteristics of our protocol. The NAS Parallel Benchmarks are used to demonstrate the scalability and the minimal performance impact of this protocol. Communication level micro-benchmarks show that providing higher data reliability decreases bandwidth by up to 30% relative to unprotected communications, but provides performance improvements of a factor of four over TCP/IP running over InfiniBand DDR. In addition, application level benchmarks (communication/computation) show virtually no impact of the data reliability protocol on overall run-time.
机译:高性能计算(HPC)系统的大小和复杂性迅速增长。因此,瞬态和持久的网络故障可能会在应用程序运行时间的时间范围内发生,从而降低了这些系统的高效利用率。用来处理这类故障的无处不在的网络协议是TCP / IP,但是,这种协议的可用的实现提供了HPC系统用户不可接受的性能,并且不提供高带宽,现代互连的低等待时间的通信。本文介绍了用于提供保护免受诸如丢弃的数据包,损坏的分组和网络接口丢失的方法的方法,同时保持高性能通信。使用供应商提供的TCP / IP和O / S旁路低级通信堆栈的微型基准实验用于通过InfiniBand和MyRinet展示我们协议的高性能特征。 NAS并行基准用于展示该协议的可扩展性和最小性能影响。通信级微基准显示,提供更高的数据可靠性相对于未受保护的通信,提供更高的数据可靠性高达30%,但在InfiniBand DDR运行的TCP / IP上提供了四个倍数的性能改进。此外,应用级基准(通信/计算)在整体运行时几乎没有对数据可靠性协议的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号