首页> 外文期刊>Information and software technology >Efficient feature extraction model for validation performance improvement of duplicate bug report detection in software bug triage systems
【24h】

Efficient feature extraction model for validation performance improvement of duplicate bug report detection in software bug triage systems

机译:软件BUG分类系统中验证性能提高验证性能提升的高效特征提取模型

获取原文
获取原文并翻译 | 示例
       

摘要

Context: There are many duplicate bug reports in the semi-structured software repository of various software bug triage systems. The duplicate bug report detection (DBRD) process is a significant problem in software triage systems.Objective: The DBRD problem has many issues, such as efficient feature extraction to calculate similarities between bug reports accurately, building a high-performance duplicate detector model, and handling continuous real-time queries. Feature extraction is a technique that converts unstructured data to structured data. The main objective of this study is to improve the validation performance of DBRD using a feature extraction model.Method: This research focuses on feature extraction to build a new general model containing all types of features. Moreover, it introduces a new feature extractor method to describe a new viewpoint of similarity between texts. The proposed method introduces new textual features based on the aggregation of term frequency and inverse document frequency of text fields of bug reports in uni-gram and bi-gram forms. Further, a new hybrid measurement metric is proposed for detecting efficient features, whereby it is used to evaluate the efficiency of all features, including the proposed ones.Results: The validation performance of DBRD was compared for the proposed features and state-of-the-art features. To show the effectiveness of our model, we applied it and other related studies to DBRD of the Android, Eclipse, Mozilla, and Open Office datasets and compared the results. The comparisons showed that our proposed model achieved (i) approximately 2% improvement for accuracy and precision and more than 4.5% and 5.9% improvement for recall and Fl-measure, respectively, by applying the linear regression (LR) and decision tree (DT) classifiers and (ii) a performance of 91%-99% (average -97%) for the four metrics, by applying the DT classifier as the best classifier.Conclusion: Our proposed features improved the validation performance of DBRD concerning runtime performance. The pre-processing methods (primarily stemming) could improve the validation performance of DBRD slightly (up to 0.3%), but rule-based machine learning algorithms are more useful for the DBRD problem. The results showed that our proposed model is more effective both for the datasets for which state-of-the-art approaches were effective (i.e., Mozilla Firefox) and those for which state-of-the-art approaches were less effective (i.e., Android). The results also showed that the combination of all types of features could improve the validation performance of DBRD even for the LR classifier with less validation performance, which can be implemented easily for software bug triage systems. Without using the longest common subsequence (LCS) feature, which is effective but time-consuming, our proposed features could cover the effectiveness of LCS with lower time-complexity and runtime overhead. In addition, a statistical analysis shows that the results are reliable and can be generalized to other datasets or similar classifiers.
机译:上下文:各种软件BUG分类系统的半结构化软件存储库中有许多重复的错误报告。重复的错误报告检测(DBRD)进程是软件分类系统中的一个重大问题。模拟:DBRD问题有许多问题,例如高效的功能提取,以准确地计算错误报告之间的相似性,构建高性能重复探测器模型,以及处理持续的实时查询。特征提取是一种将非结构化数据转换为结构化数据的技术。本研究的主要目的是使用特征提取模型来提高DBRD的验证性能。方法:本研究侧重于构建包含所有类型功能的新一般模型。此外,它引入了一种新的特征提取器方法来描述文本之间的新的相似之处的观点。该方法基于在UNI-GRAM和Bi-Gram形式中的错误报告文本字段的术语频率和逆文档频率的聚合来引入新的文本特征。此外,提出了一种用于检测有效特征的新的混合测量度量,由此用于评估包括所提出的功能的所有特征的效率。结果:将DBRD的验证性能进行比较,以便为所提出的特征和状态进行比较-Art功能。为了展示我们模型的有效性,我们将其应用于Android,Eclipse,Mozilla和Open Office数据集的DBRD并将其应用于DBRD并进行了比较了结果。比较表明,我们通过应用线性回归(LR)和决策树(DT)分别实现了我们所提出的准确性和精度和5.5%和5.9%和5.9%的提高大约2%和5.9%和5.9%的提高(DT通过将DT分类器作为最佳分类器应用DT分类器,分类器和(ii)分类器和(ii)的性能为91%-99%(平均-97%)。配置:我们提出的功能改进了DBRD关于运行时性能的验证性能。预处理方法(主要是Stemming)可以提高DBRD的验证性能(高达0.3%),但基于规则的机器学习算法对DBRD问题更有用。结果表明,我们所提出的模型对于数据集比更有效的是最先进的方法有效的(即Mozilla Firefox)以及最先进的方法效果较小的数据集(即,安卓)。结果还表明,即使对于LR分类器,所有类型的功能的组合也可以提高DBRD的验证性能,即使具有较少验证性能的LR分类器,可以轻松实现软件错误分类系统。不使用最长的常见后续(LCS)功能,这是有效但耗时的功能,我们所提出的功能可以涵盖LCS的有效性,具有较低的时间复杂性和运行时开销。另外,统计分析表明结果是可靠的并且可以推广到其他数据集或类似的分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号