首页> 外文会议>IEEE Annual Computer Software and Applications Conference >Software Fault Proneness Prediction with Group Lasso Regression: On Factors that Affect Classification Performance
【24h】

Software Fault Proneness Prediction with Group Lasso Regression: On Factors that Affect Classification Performance

机译:基于组套索回归的软件故障倾向性预测:影响分类性能的因素

获取原文

摘要

Machine learning algorithms have been used extensively for software fault proneness prediction. This paper presents the first application of Group Lasso Regression (G-Lasso) for software fault proneness classification and compares its performance to six widely used machine learning algorithms. Furthermore, we explore the effects of two factors on the prediction performance: the effect of imbalance treatment using the Synthetic Minority Over-sampling Technique (SMOTE), and the effect of datasets used in building the prediction models. Our experimental results are based on 22 datasets extracted from open source projects. The main findings include: (1) G-Lasso is robust to imbalanced data and significantly outperforms the other machine learning algorithms with respect to the Recall and G-Score, i.e., the harmonic mean of Recall and (1- False Positive Rate). (2) Even though SMOTE improved the performance of all learners, it did not have statistically significant effect on G-Lasso's Recall and G-Score. Random Forest was in the top performing group of learners for all performance metrics, while Naive Bayes performed the worst of all learners. (3) When using the same change metrics as features, the choice of the dataset had no effect on the performance of most learners, including G-Lasso. Naive Bayes was the most affected, especially when balanced datasets were used.
机译:机器学习算法已广泛用于软件故障倾向性预测。本文介绍了组Lasso回归(G-Lasso)在软件故障倾向性分类中的首次应用,并将其性能与六种广泛使用的机器学习算法进行了比较。此外,我们探索了两个因素对预测性能的影响:使用综合少数族裔过采样技术(SMOTE)进行的不平衡处理的影响,以及用于构建预测模型的数据集的影响。我们的实验结果基于从开源项目中提取的22个数据集。主要发现包括:(1)G-Lasso对不平衡数据具有鲁棒性,并且在Recall和G-Score方面明显优于其他机器学习算法,即Recall和(1- False Positive Rate)的谐波均值。 (2)尽管SMOTE改善了所有学习者的表现,但对G-Lasso的Recall和G-Score没有统计学上的显着影响。在所有绩效指标中,Random Forest均是表现最佳的学习者群体,而Naive Bayes的表现则是所有学习者中最差的。 (3)当使用与要素相同的变化指标时,数据集的选择对包括G-Lasso在内的大多数学习者的表现没有影响。朴素贝叶斯受到的影响最大,尤其是在使用平衡数据集的情况下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号