首页> 外文会议>International Conference on Science in Information Technology >Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting
【24h】

Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting

机译:使用过采样和渐变升值提高不平衡数据集分类

获取原文

摘要

Imbalanced data classification is challenging task for various datasets in the real world. One of technique to enlarge the sample in minority class is oversampling to fix size as majority class. This research aims to test SMOTE, Borderline-SMOTE, and ADASYN to handle dataset imbalance and to observe its impact toward classification accuracy. Gradient Boosting applied as a classifier and seven datasets are used in this research. Accuracy, recall, precision, F1-Score, AUC were also implemented to measure classifier performance. Experiments showed that oversampling technic increase accuracy from 2% to 11% for the dataset Mammography, Liver Disorders, Diabetes (Pima Indian), Indian Liver, Habberman, and Immunotherapy. Borderline-SMOTE increases higher accuracy compared to other oversampling method. Surprisingly, Breast Cancer Wisconsin has steady accuracy with or without oversampling. Even though, oversampling good for data imbalanced, the sensibility of oversampling algorithm and the nature of dataset must considered.
机译:不平衡的数据分类是真实世界中各个数据集的挑战性任务。扩大少数群体类别的方法之一是过采样,以固定为多数类的大小。本研究旨在测试少,边界 - 姆斯和Adasyn来处理数据集不平衡,并观察其对分类准确性的影响。在本研究中使用了作为分类器和七个数据集应用的梯度提升。还实现了准确性,调用,精度,F1分数,以测量分类器性能。实验表明,超采样技术从数据集乳房X线摄影,肝脏紊乱,糖尿病(PIMA印度),印度肝脏,Habberman和免疫疗法增加了2%至11%的准确性。与其他过采样方法相比,边界击球率提高了更高的准确性。令人惊讶的是,乳腺癌威斯康星州的稳定精度有或没有过采样。即使,对于数据不平衡的过采样良好,也必须考虑过采样算法的敏感性和数据集的性质。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号