【24h】

Understanding and Handling Alert Storm for Online Service Systems

机译:在线服务系统的理解和处理警报风暴

获取原文

摘要

Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, because it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm in practice, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings obtained from the study, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover representative alerts accurately. We have successfully applied our approach to the service maintenance of a large commercial bank (China EverBright Bank), and we also share our success stories and lessons learned in industry. CCS CONCEPTS • Software and its engineering $ightarrow$ Maintaining software.
机译:警报是在线服务系统监控系统中的一种关键数据源,用于将服务组件中的异常和向工程师报告。通常,服务失败的发生往往与大量警报一起,称为警报风暴。然而,警报风暴带来了诊断失败的巨大挑战,因为工程师对工程师来说是耗时和乏味的人手动调查这种压倒性的警报。为了帮助了解警报风暴,我们对基于大规模现实世界警报数据的警戒风暴进行了第一个实证研究,并获得了一些有价值的见解。根据从研究中获得的调查结果,提出了一种处理警报风暴的新方法。具体而言,这种方法包括警报风暴检测,旨在准确地识别警报风暴,并提醒风暴摘要,旨在向工程师推荐一小组代表警报进行故障诊断。我们对现实世界数据集的实验研究表明,我们的警报风暴检测可以实现高F1分数(大于0.9)。此外,我们的警报风暴摘要可以减少需要在98%以上审查的警报数量,并准确发现代表性警报。我们已成功应用于大型商业银行(中国光大银行)的服务维护的方法,我们还分享了我们的成功案例和经验教训。 CCS概念•软件及其工程$ lightarrow $维护软件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号