Currently the researches of software defect prediction ( SDP) are mainly conducted in two aspects of source acquisition from his-torical data and prediction methods.Unfortunately, the data of historical software defects we got are basically class imbalanced, traditional prediction methods will result in high misclassification of the defects data.To solve this problem, we propose to use an imbalanced classifica-tion method based on statistical sampling for software defect prediction.By comparing and analysing empirically the pros and cons in predic-tion performances of 12 combined algorithms consisting of ready samples and classifications, we derive that the SP-RF ( SpreadSubsampling combining with random forest) method shows the best overall performance, but a little weakness in false positive ratio ( FPR) .To further improve the prediction performance of the algorithm, as well as to address the deficiencies of primitive SP-RF method in bringing forth the bigger noise and information missing to original data, we propose an SP-RF-based adaptive random forest algorithm with inner-balanced sampling ( IBSBA-RF) .It is demonstrated by the experiment that the IBSBA-RF algorithm can noticeably reduce the FPR of predication result, and further increases the AUC and Balance measure of the prediction result as well.%目前软件缺陷预测的研究主要是从历史数据获取来源和预测方法这两方面入手。然而,获取到的软件历史缺陷数据往往是非均衡的,传统的预测方法会给缺陷数据带来极大的误分率。针对这一问题,提出使用基于统计抽样的非均衡分类方法来预测软件缺陷。通过经验性地对比分析12种已有抽样与分类算法组合的预测性能优劣,得到SpreadSubsampling和随机森林结合的方法( SP-RF)综合表现最好,但具有较高伪正率( FPR)。为了进一步提高预测性能,针对原始SP-RF方法会对原始数据带来较大的噪音及信息缺失等不足,提出一种基于SP-RF的内置均衡化抽样的自适应随机森林改进算法( IBSBA-RF)。实验表明,IBSBA-RF算法可以显著降低预测结果的FPR,并且进一步提高了预测结果的AUC和Balance值。
展开▼