首页> 外文会议>2011 23rd IEEE International Conference on Tools with Artificial Intelligence >Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data
【24h】

Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data

机译:数据采样对软件测量数据特征选择稳定性的影响

获取原文

摘要

Software defect prediction can be considered a binary classification problem. Generally, practitioners utilize historical software data, including metric and fault data collected during the software development process, to build a classification model and then employ this model to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). Limited project resources can then be allocated according to the prediction results by (for example) assigning more reviews and testing to the modules predicted to be potentially defective. Two challenges often come with the modeling process: (1) high-dimensionality of software measurement data and (2) skewed or imbalanced distributions between the two types of modules (fp and nfp) in those datasets. To overcome these problems, extensive studies have been dedicated towards improving the quality of training data. The commonly used techniques are feature selection and data sampling. Usually, researchers focus on evaluating classification performance after the training data is modified. The present study assesses a feature selection technique from a different perspective. We are more interested in studying the stability of a feature selection method, especially in understanding the impact of data sampling techniques on the stability of feature selection when using the sampled data. Some interesting findings are found based on two case studies performed on datasets from two real-world software projects.
机译:软件缺陷预测可以被认为是二进制分类问题。通常,从业人员利用历史软件数据(包括在软件开发过程中收集的度量标准和故障数据)来建立分类模型,然后使用该模型来预测新程序模块为易错(fp)或不易错( nfp)。然后,可以根据预测结果,通过(例如)为预测有潜在缺陷的模块分配更多评论和测试,来分配有限的项目资源。建模过程通常面临两个挑战:(1)软件测量数据的高维性;(2)这些数据集中两种类型的模块(fp和nfp)之间的分布偏斜或不平衡。为了克服这些问题,已经进行了广泛的研究以提高训练数据的质量。常用的技术是特征选择和数据采样。通常,研究人员会在修改训练数据后集中精力评估分类性能。本研究从不同的角度评估了一种特征选择技术。我们对研究特征选择方法的稳定性更感兴趣,尤其是在了解使用采样数据时数据采样技术对特征选择稳定性的影响。基于对两个实际软件项目的数据集进行的两个案例研究,发现了一些有趣的发现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号