首页> 外文会议>IEEE SmartWorld Conference >Towards a Framework for Monitoring and Analyzing High Performance Computing Environments Using Kubernetes and Prometheus

【24h】

Towards a Framework for Monitoring and Analyzing High Performance Computing Environments Using Kubernetes and Prometheus

机译：迈向使用Kubernetes和Prometheus监控和分析高性能计算环境的框架

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The challenge of monitoring a computational center grows as the center deploys larger and more diverse systems. As system size grows, it becomes harder to discern the problem from the noise. Staff often experience alert fatigue, an occurrence when so many alerts come in that the actual problem is obscured by false alarms or by alarms for issues that are symptoms of the core problem. The National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory (LBNL) has begun to address this issue by ensuring that most alerts are actionable and that multiple alerts for common problems, such as node outages, do not arise. However, more work is needed for these solutions to be extensible to emerging extreme-scale systems. In this paper, we propose a framework for proactively monitoring and managing data center operations, capable of scaling to accommodate the heterogeneity and complexity of next-generation systems. We describe a new architecture for the Operations Monitoring and Notification Infrastructure (OMNI) at NERSC that enables proactive monitoring and management at scale by integrating state-of-the-art technology, such as Kubernetes, Prometheus, Grafana, and other predictive platforms with data from metrics, sensors, and analytics engines. The system will support the operation of the upcoming Perlmutter HPC system, to be delivered in late 2020, as well as NERSC's successive computational system deployments. This comprehensive infrastructure will assist in centrally orchestrating services and deployments, automatically analyzing streaming data, correlating multiple-sourced data, and thresholding alerts to identify core issues from a single view.

机译：监测计算中心的挑战会随着中心部署较大和更多样化的系统而增长。随着系统尺寸的增长，辨别噪声问题变得更难。工作人员经常经历警报疲劳，发生了这么多的警报，因为这么多的警报来看，实际问题被误报或通过警报掩盖了核心问题的症状的警报。在劳伦斯伯克利国家实验室（LBNL）的国家能源研究科学计算中心（NERSC）已经开始通过确保最具警报是可行的，并且常见问题的多个警报，例如节点中断，不会出现。但是，这些解决方案需要更多的工作，以便对新出现的极限系统可扩展。在本文中，我们提出了一个主动监测和管理数据中心操作的框架，能够缩放以适应下一代系统的异质性和复杂性。我们描述了NERSC的操作监控和通知基础设施（OMNI）的新架构，通过整合最先进的技术，例如Kubernetes，Prometheus，Grafana和其他具有数据的预测平台，可以在规模上实现主动监控和管理从指标，传感器和分析发动机。该系统将支持即将到来的Perlmuter HPC系统的操作，在2020年代后期交付，以及NERSC的连续计算系统部署。此全面的基础架构将帮助集中协调协调服务和部署，自动分析流数据，关联多源数据和阈值的警报，以从单个视图中识别核心问题。

著录项

来源
《IEEE SmartWorld Conference》|2019年|1 v.|共6页
会议地点
作者
Nitin Sukhija; Elizabeth Bautista;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Monitoring; Data centers; Tools; Market research; Measurement; Organizations; Hardware;

机译：监测;数据中心;工具;市场研究;测量;组织;硬件;

相似文献

外文文献
中文文献
专利

1. Analyzing Security Threats to Virtual Machines Monitor in Cloud Computing Environment [J] . Ahmad Fayez S. Althobaiti Journal of Information Security . 2017,第1期

机译：分析云计算环境中对虚拟机监视器的安全威胁
2. A high-performance computing framework for analyzing the economic impacts of wind correlation [J] . Petra Cosmin G., Zavala Victor M., Nino-Ruiz Elias D., Electric power systems research . 2016,第Deca期

机译：用于分析风相关性经济影响的高性能计算框架
3. Lightweight Power Monitoring Framework for Virtualized Computing Environments [J] . Hautala Ilkka, Boutellier Jani, Silven Olli IEEE Transactions on Computers . 2020,第1期

机译：虚拟化计算环境的轻量级电源监控框架
4. Towards a Framework for Monitoring and Analyzing High Performance Computing Environments Using Kubernetes and Prometheus [C] . Nitin Sukhija, Elizabeth Bautista IEEE SmartWorld Conference;IEEE Ubiquitous Intelligence Computing Conference;IEEE Advanced Trusted Computing Conference;IEEE Scalable Computing Communications Conference;Cloud Big Data Computing Conference;IEEE Internet of People Conference;IEEE Smart City Innovation Conference . 2019

机译：建立一个使用Kubernetes和Prometheus监视和分析高性能计算环境的框架
5. A Distributed Computing Framework to Manage, Query, and Analyze Big Geospatial Data for Urban Studies - Case Studies with Urban Heat Island and Tourist Movement Pattern Mining [D] . Hu, Fei. 2018

机译：用于管理，查询和分析大地理空间数据以进行城市研究的分布式计算框架-城市热岛和游客运动模式挖掘的案例研究
6. Kubernetes Cluster for Automating Software Production Environment [O] . Aneta Poniszewska-Marańda, Ewa Czechowska 2021

机译：Kubernetes集群用于自动化软件生产环境
7. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments ∗ [O] . S. Böhm, C. Engelmann, S. L. Scott 2014

机译：用于分析大规模并行和分布式计算环境的实时系统监控数据的聚合*

Towards a Framework for Monitoring and Analyzing High Performance Computing Environments Using Kubernetes and Prometheus

摘要

著录项

相似文献

相关主题

期刊订阅