Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

机译：对大规模HPC系统的互连错误和网络拥塞理解和分析

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

机译：今天的高性能计算（HPC）系统能够以快速计算设备，网络互连和后端存储系统为PETAFLOPS的顺序提供性能。特别地，互连弹性和拥塞分辨率方法对整体互连和应用性能具有重要影响。对于在不同计算节点上运行多个进程的科学应用程序尤其如此，因为它们依赖于快速网络消息以频繁地通信和同步。不幸的是，HPC社区缺乏实践状态的经验报告说，详细介绍了大规模的HPC系统上发生不同的互连错误和拥塞事件。因此，在本文中，我们处理和分析泰坦超级计算机的互连数据，以便对互连故障，错误和拥塞事件的彻底了解。我们还研究了互连，错误，网络拥塞和应用特征之间的交互。

著录项

来源
《Annual IEEE/IFIP International Conference on Dependable Systems and Networks》|2018年|706p|共8页
会议地点
作者
Mohit Kumar; Saurabh Gupta; Tirthak Patel; Michael Wilder; Weisong Shi; Song Fu; Christian Engelmann; Devesh Tiwari;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP393.08;
关键词
Resilience; Time-frequency analysis; Performance evaluation; Blades; Supercomputers; Integrated circuit interconnections; Bandwidth;

机译：弹性;时间频率分析;性能评估;刀片;超级计算机;集成电路互连;带宽;

相似文献

外文文献
中文文献
专利

1. On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects [J] . George Nychis, Chris Fallin, Thomas Moscibroda, Computer communication review . 2012,第4期

机译：从网络角度看片上网络：多核互连中的拥塞和可伸缩性
2. Modeling and simulation of extreme-scale fat-tree networks for HPC systems and data centers [J] . Liu Ning, Haider Adnan, Jin Dong, ACM Transactions on Modeling and Computer Simulation . 2017,第2期

机译：用于HPC系统和数据中心的超大规模胖树网络的建模和仿真
3. FatTreeSim: Modeling Large-scale Fat-Tree Networks for HPC Systems and Data Centers Using Parallel and Discrete Event Simulation [J] . Ning Liu, Adnan Haider, Xian-He Sun, Proceedings of the Workshop on Principles of Advanced and Distributed Simulation . 2015,第CDaROM期

机译：FatTreeSim：使用并行和离散事件仿真为HPC系统和数据中心建模大型胖树网络
4. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System [C] . Mohit Kumar, Saurabh Gupta, Tirthak Patel, Annual IEEE/IFIP International Conference on Dependable Systems and Networks . 2018

机译：了解和分析大型HPC系统上的互连错误和网络拥塞
5. Applying Standard Network Centrality Measures to Analyze Error Propagation and Measure the Security of a Software System [D] . Smyre, Christina Lavern 2011

机译：应用标准网络集中度措施分析错误传播并衡量软件系统的安全性
6. Understanding Why Children Commit Scale Errors: Scale Error and Its Relation to Action Planning and Inhibitory Control and the Concept of Size [O] . Mikako Ishibashi, Yusuke Moriguchi -1

机译：理解儿童为何犯比例错误：比例错误及其与行动计划和抑制控制的关系以及大小的概念
7. Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale [O] . Park, Byung H., Hukerikar, Saurabh, Adamson, Ryan, 2017

机译：大数据符合HpC Log analytics：可扩展的理解方法极端规模的系统

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

摘要

著录项

相似文献

相关主题

期刊订阅