首页> 外文会议>International Conference on Software, Knowledge, Information Management and Applications >Cluster-based under-sampling with random forest for multi-class imbalanced classification
【24h】

Cluster-based under-sampling with random forest for multi-class imbalanced classification

机译:基于群集的下抽样,随机林进行多级不平衡分类

获取原文

摘要

Multi-class imbalanced classification has emerged as a very challenging research area in machine learning for data mining applications. It occurs when the number of training instances representing majority class instances is much higher than that of minority class instances. Existing machine learning algorithms provide a good accuracy when classifying majority class instances, but ignore/ misclassify the minority class instances. However, the minority class instances hold the most vital information and misclassifying them can lead to serious problems. Several sampling techniques with ensemble learning have been proposed for binary-class imbalanced classification in the last decade. In this paper, we propose a new ensemble learning technique by employing cluster-based under-sampling with random forest algorithm for dealing with multi-class highly imbalanced data classification. The proposed approach cluster the majority class instances and then select the most informative majority class instances in each cluster to form several balanced datasets. After that random forest algorithm is applied on balanced datasets and applied majority voting technique to classify test/ new instances. We tested the performance of our proposed method with existing popular sampling with boosting methods like: AdaBoost, RUSBoost, and SMOTEBoost on 13 benchmark imbalanced datasets. The experimental results show that the proposed cluster-based under-sampling with random forest technique achieved high accuracy for classifying both majority and minority class instances in compare with existing methods.
机译:多级不平衡分类已成为数据挖掘应用的机器学习中的一个非常具有挑战性的研究区域。当表示聚焦类实例的培训实例的数量远高于少数群体类实例时出现。现有机器学习算法在分类多数类实例时提供良好的准确性,但忽略/错误分类少数级别实例。但是,少数级别的实例持有最重要的信息并错误分类,他们会导致严重的问题。在过去十年中已经提出了几种具有集合学习的采样技术,以便在过去十年中进行二进制类不平衡分类。在本文中,我们通过采用基于群集的林算法来处理多级高度不平衡数据分类来提出一种新的集群学习技术。建议的方法群集多数类实例,然后在每个群集中选择最具信息丰富的多数类实例,以形成几个平衡数据集。之后,随机林算法应用于平衡数据集并应用了大多数投票技术来分类测试/新实例。我们测试了我们提出的方法的表现,具有促进方法的现有流行采样,如:Adaboost,Rusboost和SmoteBoost上的13个基准的非平衡数据集。实验结果表明,随机森林技术的基于群体的基于群体的抽样实现了高精度,用于分类多数和少数群体实例,与现有方法相比。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号