首页> 外文会议>Annual IEEE/IFIP International Conference on Dependable Systems and Networks >Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
【24h】

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

机译:对大规模HPC系统的互连错误和网络拥塞理解和分析

获取原文

摘要

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.
机译:今天的高性能计算(HPC)系统能够以快速计算设备,网络互连和后端存储系统为PETAFLOPS的顺序提供性能。特别地,互连弹性和拥塞分辨率方法对整体互连和应用性能具有重要影响。对于在不同计算节点上运行多个进程的科学应用程序尤其如此,因为它们依赖于快速网络消息以频繁地通信和同步。不幸的是,HPC社区缺乏实践状态的经验报告说,详细介绍了大规模的HPC系统上发生不同的互连错误和拥塞事件。因此,在本文中,我们处理和分析泰坦超级计算机的互连数据,以便对互连故障,错误和拥塞事件的彻底了解。我们还研究了互连,错误,网络拥塞和应用特征之间的交互。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号