首页> 外文会议>International Conference of Computer and Information Technology >A Hybrid Under-Sampling Method (HUSBoost) to Classify Imbalanced Data
【24h】

A Hybrid Under-Sampling Method (HUSBoost) to Classify Imbalanced Data

机译:混合欠采样方法(HUSBoost)对不平衡数据进行分类

获取原文

摘要

Imbalanced learning is the issue of learning from data when the class distribution is highly imbalanced. Class imbalance problems are seen increasingly in many domains and pose a challenge to traditional classification techniques. Learning from imbalanced data (two or more classes) creates additional complexities. Studies suggest that ensemble methods can produce more accurate results than regular Imbalance learning techniques (sampling and cost-sensitive learning). To deal with the problem, we propose a new hybrid under sampling based ensemble approach (HUSBoost) to handle imbalanced data which includes three basic steps- data cleaning, data balancing and classification steps. At first, we remove the noisy data using Tomek-Links. After that we create several balanced subsets by applying random under sampling (RUS) method to the majority class instances. These under sampled majority class instances and the minority class instances constitute the subsets of the imbalanced data-set. Having the same number of majority and minority class instances, they become balanced subsets of data. Then in each balanced subset, random forest (RF), AdaBoost with decision tree (CART) and AdaBoost with Support Vector Machine (SVM) are implemented in parallel where we use soft voting approach to get the combined result. From these ensemble classifiers we get the average result from all the balanced subsets. We also use 27 data-sets with different imbalanced ratio in order to verify the effectiveness of our proposed model and compare the experimental results of our model with RUSBoost and EasyEnsemble method.
机译:学习失衡是班级分布高度失衡时从数据中学习的问题。在许多领域中,类不平衡问题越来越多,这对传统的分类技术构成了挑战。从不平衡的数据(两个或更多类)中学习会增加额外的复杂性。研究表明,集成方法比常规的不平衡学习技术(抽样和成本敏感型学习)可产生更准确的结果。为了解决该问题,我们提出了一种新的基于采样的集成方法(HUSBoost)混合方法来处理不平衡数据,该方法包括三个基本步骤:数据清理,数据平衡和分类步骤。首先,我们使用Tomek-Links删除噪声数据。之后,我们通过对多数类实例应用随机抽样(RUS)方法来创建几个平衡子集。这些采样不足的多数类实例和少数类实例构成了不平衡数据集的子集。具有相同数量的多数和少数类实例,它们成为数据的平衡子集。然后,在每个平衡子集中,并行实现随机森林(RF),带有决策树的AdaBoost(CART)和带有支持向量机(SVM)的AdaBoost,其中我们使用软投票方法来获得组合结果。从这些集成分类器中,我们得到所有平衡子集的平均结果。为了验证所提出模型的有效性,并使用RUSBoost和EasyEnsemble方法比较了该模型的实验结果,我们还使用了27个具有不同失衡比率的数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号