首页> 外文会议>International Conference on Software, Knowledge Information Management and Applications >Cluster-based under-sampling with random forest for multi-class imbalanced classification
【24h】

Cluster-based under-sampling with random forest for multi-class imbalanced classification

机译:基于集群的随机森林欠采样,用于多类不平衡分类

获取原文

摘要

Multi-class imbalanced classification has emerged as a very challenging research area in machine learning for data mining applications. It occurs when the number of training instances representing majority class instances is much higher than that of minority class instances. Existing machine learning algorithms provide a good accuracy when classifying majority class instances, but ignore/ misclassify the minority class instances. However, the minority class instances hold the most vital information and misclassifying them can lead to serious problems. Several sampling techniques with ensemble learning have been proposed for binary-class imbalanced classification in the last decade. In this paper, we propose a new ensemble learning technique by employing cluster-based under-sampling with random forest algorithm for dealing with multi-class highly imbalanced data classification. The proposed approach cluster the majority class instances and then select the most informative majority class instances in each cluster to form several balanced datasets. After that random forest algorithm is applied on balanced datasets and applied majority voting technique to classify test/ new instances. We tested the performance of our proposed method with existing popular sampling with boosting methods like: AdaBoost, RUSBoost, and SMOTEBoost on 13 benchmark imbalanced datasets. The experimental results show that the proposed cluster-based under-sampling with random forest technique achieved high accuracy for classifying both majority and minority class instances in compare with existing methods.
机译:在数据挖掘应用的机器学习中,多类不平衡分类已成为一个非常具有挑战性的研究领域。当代表多数班级实例的训练实例的数量比少数班级实例的培训实例的数量高得多时,就会发生这种情况。现有的机器学习算法在对多数类实例进行分类时提供了很好的准确性,但忽略/错误地对了少数类实例进行了分类。但是,少数群体实例拥有最重要的信息,对它们进行错误分类会导致严重的问题。在过去的十年中,已经提出了几种具有整体学习的采样技术来进行二元类不平衡分类。在本文中,我们提出了一种新的集成学习技术,它采用基于簇的欠采样和随机森林算法来处理多类高度不平衡的数据分类。所提出的方法将多数类实例聚类,然后在每个聚类中选择信息量最大的多数类实例,以形成几个平衡的数据集。之后,将随机森林算法应用于平衡数据集,并应用多数投票技术对测试/新实例进行分类。我们在13种基准不平衡数据集上使用诸如AdaBoost,RUSBoost和SMOTEBoost之类的增强方法,通过现有的流行采样测试了我们提出的方法的性能。实验结果表明,与现有方法相比,所提出的基于随机森林技术的基于聚类的欠采样实现了对多数和少数类别实例的分类的高精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号