首页> 外文会议>IEEE International Conference on Software Quality, Reliability and Security >DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports
【24h】

DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports

机译:DURFEX:一种特征提取技术,可有效检测重复的错误报告

获取原文

摘要

The detection of duplicate bug reports can help reduce the processing time of handling field crashes. This is especially important for software companies with a large client base where multiple customers can submit bug reports, caused by the same faults. There exist several techniques for the detection of duplicate bug reports; many of them rely on some sort of classification techniques applied to information extracted from stack traces. They classify each report using functions invoked in the stack trace associated with the bug report. The problem is that typical bug repositories may have stack traces that contain tens of thousands of functions, which causes the curse of dimensionality problem. In this paper, we propose a feature extraction technique that reduces the feature size and yet retains the information that is most critical for the classification. The proposed feature extraction approach starts by abstracting stack traces of function calls into sequences of package names, by replacing each function with the package in which it is defined. We then segment these traces into multiple N-grams of variable length and map them to fixed-size sparse feature vectors, which are used to measure the distance between the stack trace of incoming bug report with a historical set of bug reports stack traces. The linear combination of stack trace similarity and non-textual fields such as component and severity are then used to measure the distance of a bug report with a historical set of bug reports. We show the effectiveness of our approach by applying it to the Eclipse bug repository that contains tens of thousands of bug reports. Our approach outperforms the approach that uses distinct function names, while significantly reducing the processing time.
机译:检测重复的错误报告可以帮助减少处理字段崩溃的处理时间。对于拥有庞大客户群的软件公司而言,这一点尤其重要,因为在该公司中,多个客户可以提交由相同故障引起的错误报告。有几种检测重复错误报告的技术。它们中的许多依赖于应用于从堆栈跟踪中提取的信息的某种分类技术。他们使用与错误报告关联的堆栈跟踪中调用的函数对每个报告进行分类。问题在于典型的错误存储库可能具有包含成千上万个函数的堆栈跟踪,这引起了维度问题的诅咒。在本文中,我们提出了一种特征提取技术,该技术可以减小特征尺寸,同时保留对于分类最关键的信息。所提出的特征提取方法是通过将每个函数替换为定义函数的程序包,将函数调用的堆栈轨迹抽象为程序包名称序列。然后,我们将这些迹线分割成多个N个可变长度的g-gram,并将它们映射到固定大小的稀疏特征向量,这些向量用于测量传入的错误报告的堆栈跟踪与错误记录堆栈跟踪的历史记录之间的距离。然后,使用堆栈跟踪相似度和非文本字段(例如组件和严重性)的线性组合来测量具有历史错误报告集的错误报告的距离。通过将其应用到包含成千上万个错误报告的Eclipse错误存储库中,我们展示了这种方法的有效性。我们的方法优于使用不同函数名称的方法,同时显着减少了处理时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号