首页> 中文期刊> 《传感器与微系统》 >基于改进CHI特征选择的情感文本分类研究

基于改进CHI特征选择的情感文本分类研究

         

摘要

为了提高情感文本分类的准确率,对英文情感文本不同的预处理方式进行了研究,同时提出了一种改进的卡方统计量(CHI)特征提取算法.卡方统计量是一种有效的特征选择方法,但分析发现存在负相关现象和倾向于选择低频特征词的问题.为了克服不足之处,在考虑到词频、集中度和分散度等因素的基础上,考虑文本的长短不均衡和特征词分布,对词频进行归一化,提出了一种改进的卡方统计量特征提取算法.利用经典朴素贝叶斯和支持向量机分类算法在均衡语料、非均衡语料和混合长短文本语料上实验,实验结果表明:新的方法提高了情感文本分类的准确率.%In order to improve the accuracy of sentiment text classification,different preprocessing methods of the sentiment of English text is studied,and an improved algorithm of Chi-square statistic (CHI) feature extraction is put forward.CHI is one of the most efficient feature selection methods,but there are two weaknesses,negative correlation phenomenon and tend to choose low-frequency feature words.In order to overcome these two shortcomings,on the basis of taking into account factors of word frequency,concentration information and dispersion information,considering the length of the text is not balanced and the distribution of feature words,word frequency is normalized,CHI feature extraction algorithm is proposed.Using classical naive Bayes and support vector machine(SVM) classification algorithms experiments is carried out on balanced corpus,imbalanced corpus and mixed-length corpus,and experimental results show that the new method improves accuracy of sentiment text classification.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号