In the domain of software quality classification, data mining techniques are used to construct models (learners) for identifying software modules that are most likely to be fault-prone. The performance of these models, however, can be negatively affected by class imbalance and noise. Data sampling techniques have been proposed to alleviate the problem of class imbalance, but the impact of data quality on these techniques has not been adequately addressed. We examine the combined effects of noise and imbalance on classification performance when seven commonly-used sampling techniques are applied to software quality measurement data. Our results show that some sampling techniques are more robust in the presence of noise than others. Further, sampling techniques are affected by noise differently given different levels of imbalance.
展开▼