首页> 中文期刊>计算机应用研究 >不平衡数据集上的文本分类特征选择新方法

不平衡数据集上的文本分类特征选择新方法

     

摘要

针对不平衡数据集上进行文本分类,传统的特征选择方法容易导致分类器倾向于大类而忽视小类,提出一种新的特征选择方法IPR(integrated probability ratio).该方法综合考虑特征在正类和负类中的分布性质,结合四种衡量特征类别相关性的指标对特征词进行评分,能够更好地解决传统特征选择方法在不平衡数据集上的不适应性,在不降低大类分类性能的同时提高了小类的识别率.实验结果表明,该方法有效可行.%Handing unbalanced data sets in text classification, the traditional feature selection approach more likely tends to large categories and neglects sub-categories. To tackle this problem, this paper proposed a new feature selection approach IPR. This approach considered the distribution property of feature between the positive class and negative class, combined four measure indicators for features with categories distinguishing ability, this approach had solved the problem which traditional fea-ture selection was not adaptive to unbalanced data set and improving the recognition rate of sub-categories,but hadn' t reduced performance of the large categories. Experimental result shows that it is an effective and feasible feature selection approach.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号