首页> 外文会议>IEEE/IFIP International Conference on Dependable Systems and Networks-Supplemental Volume >Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System
【24h】

Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System

机译:近实时服务器重新启动监控和大型系统的根本原因分析

获取原文

摘要

Large-scale Internet services run on a fleet of distributed servers, and the continuous availability of the hardware is key to the robustness of the services. Unplanned reboots disrupt the services running on the hardware and lower the fleet availability. Server reboots are also important signals that could indicate underlying issues such as memory leaks from the services, catastrophic hardware failures, and network or power disruptions at the datacenters.In this paper, we present an at-scale, near-realtime reboot monitoring framework built with multiple state-of-the-art data infrastructures, as well as machine learning-based anomaly detection and automated root cause analysis across hundreds of server attribute combinations. We observed that 1% of the reboots in our hardware fleet were associated with kernel panics and out-of-memory events, and these reboots exhibit strong locality temporally and across services
机译:大型互联网服务在分布式服务器的舰队上运行,并且硬件的连续可用性是服务稳健性的关键。 无计划的重启会扰乱在硬件上运行的服务并降低车队可用性。 服务器重新启动也是重要的信号,可以指示潜在的问题,例如来自服务,灾难性硬件故障和数据中心的网络或电力中断的内存泄漏。在本文中,我们呈现了一个尺度,近实时重启监控框架 具有多种最先进的数据基础架构,以及基于机器的基于机器的异常检测和自动root原因分析,跨数百个服务器属性组合。 我们观察到,我们的硬件船队中的1%的重启与内核恐慌和内存失败相关,这些重启仍在暂时展示了强大的当地和跨服务

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号