首页> 外文会议>International conference on electrical, control computer engineering >Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction
【24h】

Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction

机译:基于混合采样和随机森林的机器学习软件缺陷预测方法

获取原文

摘要

The software has turn into an imperious part of human's life. In the recent computing era, many large-scale complex network systems and millions of modern technological devices produce a huge amount of data every second. Among these data, the amount of imbalanced data is relatively excessive. The machine learning model is miss leaded by these imbalanced data. Software Defect Prediction (SDP) is a standout amongst the most helping exercises during the testing phase. The estimated cost of finding and fixing defects is approximately billions of pounds per year. To reduce this problem, software defect prediction has come forth but need fine tuning to have expected efficiency. In this chapter, we have proposed a new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system. The proposed model works as follows. First, need to remove highly correlated features and turn all the feature in the same scale using the scaling feature approach. Second, we have used Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic (ADASYN) and Hybrid sampling method to balance highly imbalanced datasets. Third, Random Forest Importance and Chi-square algorithms are chosen to find out the factors which have high effect on software defect. Cross validation is used to remove overriding problem. Scikit-learn library is used for machine learning algorithms. Pandas library is used for data processing. Matplotlib, and PyPlot are used for graph and data visualization respectively. The hybrid sampling method and Random Forest (RF) algorithms achieved the highest prediction accuracy about 93.26% by showing its superiority.
机译:该软件已成为人类生活中不可或缺的一部分。在最近的计算时代,许多大型复杂的网络系统和数百万个现代技术设备每秒产生大量数据。在这些数据中,不平衡数据的数量相对过多。这些不平衡的数据导致缺少机器学习模型。在测试阶段,软件缺陷预测(SDP)是最有帮助的练习之一。每年发现和修复缺陷的估计成本约为数十亿英镑。为了减少此问题,已经提出了软件缺陷预测,但需要进行微调以达到预期的效率。在本章中,我们提出了一种基于机器学习方法的新模型来预测软件缺陷并确定可能有助于软件工程师识别系统中最容易出现缺陷的部分的关键因素。所提出的模型的工作原理如下。首先,需要删除高度相关的要素,并使用缩放要素方法将所有要素转换为相同的比例。其次,我们使用了综合少数族裔过采样技术(SMOTE),自适应合成(ADASYN)和混合采样方法来平衡高度不平衡的数据集。第三,选择随机森林重要性和卡方算法,找出对软件缺陷影响较大的因素。交叉验证用于消除最重要的问题。 Scikit-learn库用于机器学习算法。熊猫库用于数据处理。 Matplotlib和PyPlot分别用于图形和数据可视化。混合采样方法和随机森林(RF)算法通过显示其优越性,获得了最高的预测精度,约为93.26%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号