【24h】

An Empirical Investigation of Incident Triage for Online Service Systems

机译:在线服务系统事件分类的实证研究

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Online service systems have become increasingly popular. During operation of an online service system, incidents (unplanned interruptions or outages of the service) are inevitable. As an initial step of incident management, it is important to be able to automatically assign an incident report to a suitable team. We call this step incident triage, which can significantly affect the efficiency and accuracy of overall incident management. To better understand the incident-triage practice in industry, we perform an empirical study of incident triage on 20 large-scale online service systems in Microsoft. We find that incorrect assignment of incident reports occurs frequently and incurs unnecessary cost, especially for the incidents with high severity. For example, about 4.11% to 91.58% of incident reports are reassigned at least once and the average increment in incident-triage time caused by the reassignments is up to 10.16X. Considering the similarity between bug triage (automatically assigning bug reports to software developers) and incident triage, we then explore the applicability of typical bug-triage techniques to incident triage for online service systems. The results demonstrate that these bug-triage techniques are able to correctly assign incident reports to a certain extent, but still need to be further improved, especially for the incident reports that are assigned incorrectly at the first time. We further discuss possible ways to improve the accuracy of incident triage based on the empirical study. To our best knowledge, we are the first to investigate incident triage in industrial practice. Our results are useful for both practitioners and researchers to develop methods and tools to improve the current incident-triage practice for online service systems.
机译:在线服务系统变得越来越流行。在线服务系统运行期间,事件(服务的计划外中断或中断)是不可避免的。作为事件管理的第一步,重要的是能够自动将事件报告分配给合适的团队。我们将这一步骤称为事件分类,这可能会严重影响整体事件管理的效率和准确性。为了更好地了解行业中的事件分类方法,我们在Microsoft的20个大型在线服务系统上进行了事件分类的实证研究。我们发现事件报告的错误分配经常发生,并且会产生不必要的成本,尤其是对于严重性较高的事件。例如,约有4.11%至91.58%的事件报告至少被重新分配了一次,由重新分配导致的事件分类时间平均增加了10.16倍。考虑到错误分类(自动将错误报告分配给软件开发人员)和事件分类之间的相似性,我们然后探讨了典型错误分类技术对在线服务系统的事件分类的适用性。结果表明,这些错误分类技术能够在一定程度上正确分配事件报告,但仍需要进一步改进,尤其是对于第一次错误分配的事件报告。基于实证研究,我们进一步讨论了提高事件分类的准确性的可能方法。据我们所知,我们是第一个在工业实践中调查事件分类的人。我们的结果对于从业人员和研究人员开发有用的方法和工具以改进当前在线服务系统的事件分类实践很有用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号