首页> 外文期刊>Expert Systems with Application >Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies
【24h】

Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies

机译:使用实验设计和响应面方法确定具有不平衡数据的分类模型的最佳重采样策略

获取原文
获取原文并翻译 | 示例

摘要

Imbalanced data are common in many machine learning applications. In an imbalanced data set, the number of instances in at least one class is significantly higher or lower than that in other classes. Consequently, when classification models with imbalanced data are developed, most classifiers are subjected to an unequal number of instances in each class, thus failing to construct an effective model. Balancing sample sizes for various classes using a re-sampling strategy is a conventional means of enhancing the effectiveness of a classification model for imbalanced data. Despite numerous attempts to determine the appropriate re-sampling proportion in each class by using a trial-and-error method in order to construct a classification model with imbalanced data (Barandela, Vadovinos, Sanchez, & Ferri, 2004; He, Han, & Wang, 2005; Japkowicz, 2000; McCarthy, Zabar, & Weiss, 2005), the optimal strategy for each class may be infeasible when using such a method. Therefore, this work proposes a novel analytical procedure to determine the optimal re-sampling strategy based on design of experiments (DOE) and response surface methodologies (RSM). The proposed procedure, S-RSM, can be utilized by any classifier. Also, C4.5 algorithm is adopted for illustration. The classification results are evaluated by using the area under the receiver operating characteristic curve (AUC) as a performance measure. Among the several desirable features of the AUC index include independence of the decision threshold and invariance to a priori class probabilities. Furthermore, five real world data sets demonstrate that the higher AUC score of the classification model based on the training data obtained from the S-RSM is than that obtained using oversampling approach or undersampling approach.
机译:数据不平衡在许多机器学习应用程序中很常见。在不平衡的数据集中,至少一个类别中的实例数量明显高于或低于其他类别中的实例数量。因此,当开发具有不平衡数据的分类模型时,大多数分类器在每个类中都会受到不等数量的实例,从而无法构建有效的模型。使用重采样策略来平衡各种类别的样本大小是提高不平衡数据分类模型有效性的常规方法。尽管尝试了多次尝试以尝试错误的方法来确定每个类别中适当的重采样比例,以构建具有不平衡数据的分类模型(Barandela,Vadovinos,Sanchez和&Ferri,2004; He,Han和Wang,2005; Japkowicz,2000; McCarthy,Zabar,&Weiss,2005),使用这种方法时,针对每个班级的最佳策略可能并不可行。因此,这项工作提出了一种新颖的分析程序,可以基于实验设计(DOE)和响应面方法(RSM)确定最佳的重采样策略。所提议的过程S-RSM可以被任何分类器利用。另外,采用C4.5算法进行说明。通过使用接收器工作特性曲线(AUC)下的面积作为性能指标来评估分类结果。 AUC索引的几个理想特征包括决策阈值的独立性和对先验概率的不变性。此外,五个真实世界的数据集表明,基于从S-RSM获得的训练数据的分类模型,其AUC得分要高于使用过采样或欠采样方法获得的得分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号