...
首页> 外文期刊>Operating systems review >Triage: Diagnosing Production Run Failures at the User's Site
【24h】

Triage: Diagnosing Production Run Failures at the User's Site

机译:分流:在用户站点上诊断生产运行故障

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Diagnosing production run failures is a challenging yet important task. Most previous work focuses on offsite diagnosis, i.e. development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3) it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information (e.g. coredumps) to programmers. To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically—using different diagnosis techniques—analyze an occurring failure. Triage employs a failure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables. We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triage saves time (99.99% confidence), reducing the total time to fix by almost half.
机译:诊断生产运行故障是一项具有挑战性但重要的任务。以前的大多数工作都集中在异地诊断上,即与在场程序员一起进行开发现场诊断。这不足以解决生产运行中的故障,因为:(1)难以在异地重现故障以进行诊断; (2)异地诊断无法为恢复或安全目的提供及时的指导; (3)提供程序员来诊断每个生产运行故障是不可行的; (4)隐私问题限制了向程序员释放信息(例如,核心转储)。为了解决生产运行中的故障,我们提出了一个称为Triage的系统,该系统会在故障发生时自动执行现场软件故障诊断。它提供了详细的诊断报告,包括故障性质,触发条件,相关代码和变量,故障传播链以及潜在的解决方法。 Triage通过利用轻量级的重新执行支持来有效捕获故障环境并重复重放故障时刻,并使用不同的诊断技术动态地分析发生的故障,从而实现了这一目标。 Triage采用了一种故障诊断协议,该协议模仿了人类进行调试的步骤。该可扩展协议提供了一个框架,可以使用各种现有和新的诊断技术。我们还提出了一种新的故障诊断技术,即增量分析,以识别与故障相关的条件,代码和变量。我们在来自9个开源应用程序(包括4个服务器)的10个真实软件故障的真实系统实验中评估了这些想法。 Triage可以准确诊断所评估的故障,提供可能的根本原因,甚至是故障传播链,同时将正常运行开销保持在5%以下。最后,我们对真实错误的诊断和修复的用户研究表明,Triage节省了时间(99.99%的置信度),将修复总时间减少了近一半。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号