首页> 外文期刊>Mathematical Problems in Engineering: Theory, Methods and Applications >An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data
【24h】

An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data

机译:通过简化训练数据的交叉项目缺陷预测的改进方法

获取原文
获取外文期刊封面目录资料

摘要

Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention. To the best of our knowledge, however, the performance of existing approaches is usually poor, because of low quality cross-project training data. The objective of this study is to propose an improved method for CPDP by simplifying training data, labeled as TDSelector, which considers both the similarity and the number of defects that each training instance has (denoted by defects), and to demonstrate the effectiveness of the proposed method. Our work consists of three main steps. First, we constructed TDSelector in terms of a linear weighted function of instances’ similarity and defects. Second, the basic defect predictor used in our experiments was built by using the Logistic Regression classification algorithm. Third, we analyzed the impacts of different combinations of similarity and the normalization of defects on prediction performance and then compared with two existing methods. We evaluated our method on 14 projects collected from two public repositories. The results suggest that the proposed TDSelector method performs, on average, better than both baseline methods, and the AUC values are increased by up to 10.6% and 4.3%, respectively. That is, the inclusion of defects is indeed helpful to select high quality training instances for CPDP. On the other hand, the combination of Euclidean distance and linear normalization is the preferred way for TDSelector. An additional experiment also shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method.
机译:有限历史数据有限的项目的跨项目缺陷预测(CPDP)引起了很多关注。然而,据我们所知,现有方法的表现通常差,因为质量低的交叉项目培训数据。本研究的目的是通过简化标记为TDSelector的培训数据提出了一种改进的CPDP方法,这考虑了每个训练实例(由缺陷表示)的相似性和缺陷的数量,并证明了效果提出的方法。我们的工作包括三个主要步骤。首先,我们在实例的线性加权函数方面构建了TDSelector;相似性和缺陷。其次,通过使用Logistic回归分类算法构建了我们实验中使用的基本缺陷预测器。第三,我们分析了不同相似性组合的影响和对预测性能的缺陷的标准化,然后与现有方法进行比较。我们对来自两名公共存储库收集的14个项目进行了评估。结果表明,所提出的TDSelector方法平均而言比基线方法更好,并且AUC值增加到10.6%和4.3%。也就是说,包含缺陷的含义确实有助于为CPDP选择高质量的培训实例。另一方面,欧几里德距离和线性归一化的组合是TDSelector的首选方法。另外的实验还显示,在培训数据中直接选择具有更多错误的实例可以进一步提高由我们的方法训练的Bug预测器的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号