首页> 外文会议>International Conference on Science in Information Technology >Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting
【24h】

Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting

机译:使用过采样和梯度提升改善不平衡数据集分类

获取原文

摘要

Imbalanced data classification is challenging task for various datasets in the real world. One of technique to enlarge the sample in minority class is oversampling to fix size as majority class. This research aims to test SMOTE, Borderline-SMOTE, and ADASYN to handle dataset imbalance and to observe its impact toward classification accuracy. Gradient Boosting applied as a classifier and seven datasets are used in this research. Accuracy, recall, precision, F1-Score, AUC were also implemented to measure classifier performance. Experiments showed that oversampling technic increase accuracy from 2% to 11% for the dataset Mammography, Liver Disorders, Diabetes (Pima Indian), Indian Liver, Habberman, and Immunotherapy. Borderline-SMOTE increases higher accuracy compared to other oversampling method. Surprisingly, Breast Cancer Wisconsin has steady accuracy with or without oversampling. Even though, oversampling good for data imbalanced, the sensibility of oversampling algorithm and the nature of dataset must considered.
机译:对于现实世界中的各种数据集,不平衡的数据分类是一项艰巨的任务。扩大少数族裔样本的一种技术是过采样以将大小固定为多数族。这项研究旨在测试SMOTE,Borderline-SMOTE和ADASYN,以处理数据集不平衡并观察其对分类准确性的影响。这项研究将梯度提升用作分类器,并使用了七个数据集。准确性,召回率,精度,F1-分数,AUC也已实施,以衡量分类器的性能。实验表明,对于乳腺X线摄影,肝病,糖尿病(Pima Indian),印度肝,Habberman和免疫治疗数据集,过采样技术的准确性从2%提高到11%。与其他过采样方法相比,Borderline-SMOTE提高了准确性。出人意料的是,威斯康星州的乳腺癌无论是否进行过采样都具有稳定的准确性。即使过采样有益于数据不平衡,也必须考虑过采样算法的敏感性和数据集的性质。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号