首页> 中文期刊> 《计算机研究与发展》 >基于样本权重的不平衡数据欠抽样方法

基于样本权重的不平衡数据欠抽样方法

         

摘要

Imbalanced data exists widely in the real world ,and its classification is a hot topic in data mining and machine learning .Under‐sampling is a widely used method in dealing imbalanced data set and its main idea is choosing a subset of majority class to make the data set balanced .However ,some useful majority class information may be lost . In order to solve the problem , an under‐sampling method based on sample weight for imbalance problem is proposed ,named as KAcBag (K‐means AdaCost bagging) .In this method ,sample weight is introduced to reveal the area where the sample is located .Firstly ,according to the sample scale ,a weight is made for each sample and is modified after clustering the data set .The samples which have less weight in the center of majority class .Then some samples are drawn from majority class in accordance with the sample weight .In the procedure ,the samples in the center of majority class can be selected easily .T he sampled majority class samples and all the minority class samples are combined as the training data set for a component classifier .After that ,we can get several decision tree sub‐classifiers . Finally , the prediction model is constructed based on the accuracy of each sub‐classifiers . Experimental tests on nineteen UCI data sets and telecom user data show that KAcBag can make the selected samples have more representativeness . Based on that ,this method can improve the the classification performance of minority class and reduce the scale of the problem .%现实世界中广泛存在不平衡数据,其分类问题是数据挖掘和机器学习的一个研究热点。欠抽样是处理不平衡数据集的一种常用方法,其主要思想是选取多数类样本中的一个子集,使数据集的样本分布达到平衡,但其容易忽略多数类中部分有用信息。为此提出了一种基于样本权重的欠抽样方法KAcBag(K‐means AdaCost bagging),该方法引入了样本权重来反映样本所处的区域,首先根据各类样本的数量初始化各样本权重,并通过多次聚类对各个样本的权重进行修改,权重小的多数类样本即处于多数类的中心区域;然后按权重大小对多数类样本进行欠抽样,使位于中心区域的样本较容易被抽中,并与所有少数类样本组成bagging 成员分类器的训练数据,得到若干个决策树子分类器;最后根据各子分类器的正确率进行加权投票生成预测模型。对19组UCI 数据集和某电信运营商客户换机数据进行了测试实验,实验结果表明:KAcBag 方法使抽样所得的样本具有较强的代表性,能有效提高少数类的分类性能并缩小问题规模。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号