Triage: Diagnosing Production Run Failures at the User's Site

Joseph Tucek; Shan Lu; Chengdu Huang; Spiros Xanthos; Yuanyuan Zhou

首页> 外文期刊>Operating systems review >Triage: Diagnosing Production Run Failures at the User's Site

【24h】

Triage: Diagnosing Production Run Failures at the User's Site

机译：分流：在用户站点上诊断生产运行故障

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Diagnosing production run failures is a challenging yet important task. Most previous work focuses on offsite diagnosis, i.e. development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3) it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information (e.g. coredumps) to programmers. To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically—using different diagnosis techniques—analyze an occurring failure. Triage employs a failure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables. We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triage saves time (99.99% confidence), reducing the total time to fix by almost half.

机译：诊断生产运行故障是一项具有挑战性但重要的任务。以前的大多数工作都集中在异地诊断上，即与在场程序员一起进行开发现场诊断。这不足以解决生产运行中的故障，因为：（1）难以在异地重现故障以进行诊断；（2）异地诊断无法为恢复或安全目的提供及时的指导；（3）提供程序员来诊断每个生产运行故障是不可行的；（4）隐私问题限制了向程序员释放信息（例如，核心转储）。为了解决生产运行中的故障，我们提出了一个称为Triage的系统，该系统会在故障发生时自动执行现场软件故障诊断。它提供了详细的诊断报告，包括故障性质，触发条件，相关代码和变量，故障传播链以及潜在的解决方法。 Triage通过利用轻量级的重新执行支持来有效捕获故障环境并重复重放故障时刻，并使用不同的诊断技术动态地分析发生的故障，从而实现了这一目标。 Triage采用了一种故障诊断协议，该协议模仿了人类进行调试的步骤。该可扩展协议提供了一个框架，可以使用各种现有和新的诊断技术。我们还提出了一种新的故障诊断技术，即增量分析，以识别与故障相关的条件，代码和变量。我们在来自9个开源应用程序（包括4个服务器）的10个真实软件故障的真实系统实验中评估了这些想法。 Triage可以准确诊断所评估的故障，提供可能的根本原因，甚至是故障传播链，同时将正常运行开销保持在5％以下。最后，我们对真实错误的诊断和修复的用户研究表明，Triage节省了时间（99.99％的置信度），将修复总时间减少了近一半。

著录项

来源
《Operating systems review》 |2007年第6期|p.131-144|共14页
作者
Joseph Tucek; Shan Lu; Chengdu Huang; Spiros Xanthos; Yuanyuan Zhou;
展开▼
作者单位

Department of Computer Science University of Illinois at Urbana Champaign;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;计算技术、计算机技术;
关键词
experimentation; reliability;

机译：实验;可靠性;

相似文献

外文文献
中文文献
专利

1. Failure to diagnose recent hepatitis C virus infections in London injecting drug users. [J] . Aarons E, Grant P, Soldan K, Journal of Medical Virology . 2004,第4期

机译：在伦敦未能诊断出最近注射丙肝的丙型肝炎病毒感染者。
2. Testing, Measuring, and Diagnosing Web Sites from the Users' Perspective [J] . Borzemski Leszek International Journal of Enterprise Information Systems . 2006,第1期

机译：从用户的角度测试，测量和诊断网站
3. METHOD TO DIAGNOSE WINDOW FAILURES AND MEASURE U-FACTORS ON SITE [J] . Kapil Varshney, Javier E. Rosa, Ian Shapiro International journal of green energy . 2012,第1a4期

机译：诊断窗口故障和测量现场U因子的方法
4. RDE: Replay DEbugging for Diagnosing Production Site Failures [C] . Peipei Wang, Hiep Nguyen, Xiaohui Gu, 2016 IEEE 35th Symposium on Reliable Distributed Systems . 2016

机译：RDE：重播调试以诊断生产站点故障
5. Onsite Fault Localization and Failure Reproduction for Diagnosing Production System Anomalies. [D] . Nguyen, Hiep Chi. 2014

机译：现场故障定位和故障重现，用于诊断生产系统异常。
6. The case manager through the eyes of the users: benefits and failures of a French case management experimentation [O] . Frederic Balard, Dominique Somme 2012

机译：用户眼中的案例管理员：法国案例管理实验的利弊
7. Leveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures [O] . Joy Arulraj, Guoliang Jin, Shan Lu 2015

机译：利用硬件的短期内存来诊断生产运行软件故障
8. Temporal Causal Diagrams for Diagnosing Failures in Cyber-Physical Systems. [R] . Mahadevan, N., Dubey, A., Karsai, G., 2014

机译：用于诊断网络物理系统故障的时间因果图。

Triage: Diagnosing Production Run Failures at the User's Site

摘要

著录项

相似文献

相关主题

期刊订阅