...
首页> 外文期刊>Applied System Innovation >SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features
【24h】

SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

机译:SMOTE-ENC:一种基于微妙的粉碎方法,用于生成名义和连续特征的合成数据

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE—Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.
机译:真实世界的数据集是严重倾斜,其中一些类显著由其他类寡不敌众。在这些情况下,机器学习算法未能取得实质性的疗效,同时预测这些代表性不足的情况。为了解决这个问题,合成少数民族过采样方法(SMOTE)的许多变化已经被提出来平衡数据集,其处理连续的特点。然而,对于具有名义和连续特征的数据集,SMOTE-NC是基于SMOTE仅过采样技术,以平衡数据。在本文中,我们提出了一个新颖的少数过采样方法,SMOTE-ENC(SMOTE编码标称和连续),其中标称特性被编码为数字值和两个这样的数字值之间的差反映了与该关联的变化量少数类。我们的实验表明,使用比使用SMOTE-NC当这些数据具有的标称功能相当数量,也当在类别特征和目标类间的一些关联模型的SMOTE-ENC方法提供更好的预测,即分类模型。此外,我们提出的方法来解决的SMOTE-NC算法的主要限制之一。 SMOTE-NC可以仅在具有由连续的和标称的特征的特征,并且如果该数据集的所有功能都是标称不能起到混合的数据集来施加。我们的新方法已经被推广到同时应用于混合数据集和标称仅数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号