Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction

机译：基于混合采样和随机森林的机器学习软件缺陷预测方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The software has turn into an imperious part of human's life. In the recent computing era, many large-scale complex network systems and millions of modern technological devices produce a huge amount of data every second. Among these data, the amount of imbalanced data is relatively excessive. The machine learning model is miss leaded by these imbalanced data. Software Defect Prediction (SDP) is a standout amongst the most helping exercises during the testing phase. The estimated cost of finding and fixing defects is approximately billions of pounds per year. To reduce this problem, software defect prediction has come forth but need fine tuning to have expected efficiency. In this chapter, we have proposed a new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system. The proposed model works as follows. First, need to remove highly correlated features and turn all the feature in the same scale using the scaling feature approach. Second, we have used Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic (ADASYN) and Hybrid sampling method to balance highly imbalanced datasets. Third, Random Forest Importance and Chi-square algorithms are chosen to find out the factors which have high effect on software defect. Cross validation is used to remove overriding problem. Scikit-learn library is used for machine learning algorithms. Pandas library is used for data processing. Matplotlib, and PyPlot are used for graph and data visualization respectively. The hybrid sampling method and Random Forest (RF) algorithms achieved the highest prediction accuracy about 93.26% by showing its superiority.

机译：该软件已成为人类生活中不可或缺的一部分。在最近的计算时代，许多大型复杂的网络系统和数百万个现代技术设备每秒产生大量数据。在这些数据中，不平衡数据的数量相对过多。这些不平衡的数据导致缺少机器学习模型。在测试阶段，软件缺陷预测（SDP）是最有帮助的练习之一。每年发现和修复缺陷的估计成本约为数十亿英镑。为了减少此问题，已经提出了软件缺陷预测，但需要进行微调以达到预期的效率。在本章中，我们提出了一种基于机器学习方法的新模型来预测软件缺陷并确定可能有助于软件工程师识别系统中最容易出现缺陷的部分的关键因素。所提出的模型的工作原理如下。首先，需要删除高度相关的要素，并使用缩放要素方法将所有要素转换为相同的比例。其次，我们使用了综合少数族裔过采样技术（SMOTE），自适应合成（ADASYN）和混合采样方法来平衡高度不平衡的数据集。第三，选择随机森林重要性和卡方算法，找出对软件缺陷影响较大的因素。交叉验证用于消除最重要的问题。 Scikit-learn库用于机器学习算法。熊猫库用于数据处理。 Matplotlib和PyPlot分别用于图形和数据可视化。混合采样方法和随机森林（RF）算法通过显示其优越性，获得了最高的预测精度，约为93.26％。

著录项

来源
《International conference on electrical, control computer engineering》|2019年|541-553|共13页
会议地点
作者
Md Anwar Hossen; Md. Shariful Islam; Nurhafizah Abu Talip Yusof; Md. Sakib Rahman; Fatema Siddika; Mostafijur Rahman; Sabira Khatun; Mohamad Shaiful Abdul Karim; S. M. Hasan Mahmud;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Software defect prediction; Machine learning; Imbalanced dataset; Chi square; Random forest importance;

机译：软件缺陷预测;机器学习;数据集不平衡;卡尺随机森林重要性;

相似文献

外文文献
中文文献
专利

1. Hybrid Approach for Software Defect Prediction Using Machine Learning with Optimization Technique [J] . C. Manjula, Lilly Florence International Journal of Information Technology . 2018,第1期

机译：机器学习与优化技术相结合的软件缺陷预测混合方法
2. Spatial prediction of landslides using a hybrid machine learning approach based on Random Subspace and Classification and Regression Trees [J] . Binh Thai Pham, Prakash Indra, Dieu Tien Bui Geomorphology . 2018,第FEBa15期

机译：基于随机子空间和分类回归树的混合机器学习方法在滑坡空间预测中的应用
3. Software Defect Prediction Based on Non-Linear Manifold Learning and Hybrid Deep Learning Techniques [J] . Kun Zhu, Nana Zhang, Qing Zhang, Computers, Materials & Continua . 2020,第2期

机译：基于非线性歧管学习和混合深层学习技术的软件缺陷预测
4. A random forest based machine learning approach for mild steel defect diagnosis [C] . S. V. Patel, Veena N. Jokhakar IEEE International Conference on Computational Intelligence and Computing Research . 2016

机译：基于随机森林的机器学习方法，用于低碳钢缺陷诊断
5. Prediction of Venous Thromboembolism Using a Hybrid Semantic Based and Machine Learning Approach [D] . Sabra, Susan 2018

机译：基于混合语义和机器学习方法的静脉血栓栓塞预测
6. Original research: Prediction of caregiver burden in amyotrophic lateral sclerosis: a machine learning approach using random forests applied to a cohort study [O] . Anna Markella Antoniadi, Miriam Galvin, Mark Heverin, 2020

机译：原始研究：预测肌萎缩性侧索硬化症的照料者负担：将随机森林应用于群组研究的机器学习方法
7. Software defect prediction using K‐PCA and various kernel‐based extreme learning machine: an empirical study [O] . Sushant Kumar Pandey, Deevashwer Rathee, Anil Kumar Tripathi 2020

机译：使用K-PCA和基于各种基于内核的极端学习机器的软件缺陷预测：实证研究

Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction

摘要

著录项

相似文献

相关主题

期刊订阅