首页> 外文会议>Annual IEEE/IFIP International Conference on Dependable Systems and Networks >Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
【24h】

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

机译:了解和分析大型HPC系统上的互连错误和网络拥塞

获取原文

摘要

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.
机译:由于快速的计算设备,网络互连和后端存储系统,当今的高性能计算(HPC)系统能够以petaflop的数量级提供性能。特别是,互连的弹性和拥塞解决方法对整体的互连和应用程序性能具有重大影响。对于在不同计算节点上运行多个进程的科学应用程序而言,尤其如此,因为它们依赖快速的网络消息来频繁地进行通信和同步。不幸的是,HPC社区缺乏实践经验报告,该报告没有详细说明大型HPC系统如何发生不同的互连错误和拥塞事件。因此,在本文中,我们将处理并分析Titan超级计算机的互连数据,以全面了解互连的故障,错误和拥塞事件。我们还研究了互连,错误,网络拥塞和应用程序特征之间的相互作用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号