首页> 外文会议>IEEE/ACM International Conference on Automated Software Engineering >How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
【24h】

How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems

机译:事件的偶然是如何?对大型在线服务系统的事件特征和优先突出

获取原文

摘要

Although tremendous efforts have been devoted to the quality assurance of online service systems, in reality, these systems still come across many incidents (i.e., unplanned interruptions and outages), which can decrease user satisfaction or cause economic loss. To better understand the characteristics of incidents and improve the incident management process, we perform the first large-scale empirical analysis of incidents collected from 18 real-world online service systems in Microsoft. Surprisingly, we find that although a large number of incidents could occur over a short period of time, many of them actually do not matter, i.e., engineers will not fix them with a high priority after manually identifying their root cause. We call these incidents incidental incidents. Our qualitative and quantitative analyses show that incidental incidents are significant in terms of both number and cost. Therefore, it is important to prioritize incidents by identifying incidental incidents in advance to optimize incident management efforts. In particular, we propose an approach, called DeepIP (Deep learning based Incident Prioritization), to prioritizing incidents based on a large amount of historical incident data. More specifically, we design an attention-based Convolutional Neural Network (CNN) to learn a prediction model to identify incidental incidents. We then prioritize all incidents by ranking the predicted probabilities of incidents being incidental. We evaluate the performance of DeepIP using real-world incident data. The experimental results show that DeepIP effectively prioritizes incidents by identifying incidental incidents and significantly outperforms all the compared approaches. For example, the AUC of DeepIP achieves 0.808, while that of the best compared approach is only 0.624 on average.
机译:虽然在线服务系统的质量保证,但实际上,这些系统仍然遇到了许多事件(即,无计划的中断和中断),这可能会降低用户满意度或导致经济损失的巨大努力。为了更好地了解事故的特点和改善事件管理过程,我们执行了从微软18个现实世界在线服务系统收集的事件的第一个大规模实证分析。令人惊讶的是,虽然在短时间内发生了大量的事件,但其中许多事实上并不重要,即,在手动识别其根本原因后,工程师不会以高优先级修复它们。我们称这些事件涉及偶然事件。我们的定性和定量分析表明,在数量和成本方面,偶然事件都很重要。因此,通过提前确定偶然事件来优化事故管理努力,优先考虑事故。特别是,我们提出了一种称为Deepip(基于深度学习的事件优先级)的方法,以基于大量历史事件数据的优先级排序。更具体地,我们设计了一种基于关注的卷积神经网络(CNN),以学习预测模型以识别附带事件。然后,我们通过排列偶然的事件的预测概率来优先排序所有事件。我们使用现实世界事件数据评估Deepip的性能。实验结果表明,Deepip通过识别偶然事件而有效地优先确定事故,并且显着优于所有比较的方法。例如,Deepip的AUC达到0.808,而最佳比较的方法的平均仅为0.624。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号