首页> 外文会议>IEEE SmartWorld Conference >Towards a Framework for Monitoring and Analyzing High Performance Computing Environments Using Kubernetes and Prometheus
【24h】

Towards a Framework for Monitoring and Analyzing High Performance Computing Environments Using Kubernetes and Prometheus

机译:迈向使用Kubernetes和Prometheus监控和分析高性能计算环境的框架

获取原文

摘要

The challenge of monitoring a computational center grows as the center deploys larger and more diverse systems. As system size grows, it becomes harder to discern the problem from the noise. Staff often experience alert fatigue, an occurrence when so many alerts come in that the actual problem is obscured by false alarms or by alarms for issues that are symptoms of the core problem. The National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory (LBNL) has begun to address this issue by ensuring that most alerts are actionable and that multiple alerts for common problems, such as node outages, do not arise. However, more work is needed for these solutions to be extensible to emerging extreme-scale systems. In this paper, we propose a framework for proactively monitoring and managing data center operations, capable of scaling to accommodate the heterogeneity and complexity of next-generation systems. We describe a new architecture for the Operations Monitoring and Notification Infrastructure (OMNI) at NERSC that enables proactive monitoring and management at scale by integrating state-of-the-art technology, such as Kubernetes, Prometheus, Grafana, and other predictive platforms with data from metrics, sensors, and analytics engines. The system will support the operation of the upcoming Perlmutter HPC system, to be delivered in late 2020, as well as NERSC's successive computational system deployments. This comprehensive infrastructure will assist in centrally orchestrating services and deployments, automatically analyzing streaming data, correlating multiple-sourced data, and thresholding alerts to identify core issues from a single view.
机译:监测计算中心的挑战会随着中心部署较大和更多样化的系统而增长。随着系统尺寸的增长,辨别噪声问题变得更难。工作人员经常经历警报疲劳,发生了这么多的警报,因为这么多的警报来看,实际问题被误报或通过警报掩盖了核心问题的症状的警报。在劳伦斯伯克利国家实验室(LBNL)的国家能源研究科学计算中心(NERSC)已经开始通过确保最具警报是可行的,并且常见问题的多个警报,例如节点中断,不会出现。但是,这些解决方案需要更多的工作,以便对新出现的极限系统可扩展。在本文中,我们提出了一个主动监测和管理数据中心操作的框架,能够缩放以适应下一代系统的异质性和复杂性。我们描述了NERSC的操作监控和通知基础设施(OMNI)的新架构,通过整合最先进的技术,例如Kubernetes,Prometheus,Grafana和其他具有数据的预测平台,可以在规模上实现主动监控和管理从指标,传感器和分析发动机。该系统将支持即将到来的Perlmuter HPC系统的操作,在2020年代后期交付,以及NERSC的连续计算系统部署。此全面的基础架构将帮助集中协调协调服务和部署,自动分析流数据,关联多源数据和阈值的警报,以从单个视图中识别核心问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号