首页> 外文会议>IEEE International conference on cluster computing >Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components
【24h】

Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components

机译:机器学习算法对高性能集群组件性能进行分类的方法和应用

获取原文

摘要

High Performance Computing Clusters are designed to host highly parallelized applications, often in excess of thousands of nodes allocated to a job. These jobs, especially those that require a high level of synchronous communication, can be greatly affected by a single poor, or even sub-standard performing component. These components, often referred to as a node, are typically comprised of CPUs, accelerator processors, memory, a communication bus, and so on. Consequently it is important to identify and eliminate these sub-standard performing nodes before a job is scheduled onto them. In this paper we will describe the process used to measure and the methodology used to quantify poor performing nodes or classify suspect performing nodes into groups, or clusters, that can be later used to identify future performance issues. This process is more involved than simply running a scientific calculation across all the nodes, finding one that was “slow”, and labeling it as a bad node. At Los Alamos, this methodology has been used successfully to find problem nodes and has helped characterize the components of other clusters to aid in the proactive elimination of potential problems.
机译:高性能计算集群旨在承载高度并行化的应用程序,通常会为工作分配超过数千个节点。这些工作,尤其是那些需要高水平同步通信的工作,可能会受到单个不良甚至不合格的组件的极大影响。这些组件(通常称为节点)通常由CPU,加速器处理器,内存,通信总线等组成。因此,在将作业调度到它们之前,识别并消除这些不合格的节点很重要。在本文中,我们将描述用于测量的过程以及用于量化性能较差的节点或将性能可疑的节点分类为组或集群的方法,这些可在以后用于识别未来的性能问题。这个过程比简单地在所有节点上进行科学计算,找到“慢”的节点并将其标记为不良节点要简单得多。在洛斯阿拉莫斯(Los Alamos),该方法已成功用于查找问题节点,并有助于表征其他集群的组成部分,以帮助主动消除潜在问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号