首页> 中文期刊>计算机工程与应用 >不均衡数据集上文本分类方法研究

不均衡数据集上文本分类方法研究

     

摘要

文本分类中数据集的不均衡问题是一个在实际应用中普遍存在的问题。从特征选择优化和分类器性能提升两方面出发,提出了一种组合的不均衡数据集文本分类方法。在特征选择方面,综合考虑特征项与类别的正负相关特性及类别区分强度对传统CHI统计特征选择方法予以改进。在数据层上,采用数据重取样方法对不均衡训练语料的不平衡性过滤减少其对分类性能的影响。实验结果表明该方法对不均衡数据集上文本可达到较好分类效果。%Class imbalance problems are often encountered in real application of automatic text classifications. From the view of the optimistic feature selection methods and the improvement of classifiers, a new text classification method on imbalanced data set is proposed. The positive and negative correlation between items and categorizations are combined with the strength of class information in the aspect of the feature selection scheme. Then on the data layer, the imbalanced characters of the training corpus are filtered by data resampling methods in order to reduce the effect on the classification. Experimental results show that the new approach can achieve better performance.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号