为了提高情感文本分类的准确率,对英文情感文本不同的预处理方式进行了研究,同时提出了一种改进的卡方统计量(CHI)特征提取算法.卡方统计量是一种有效的特征选择方法,但分析发现存在负相关现象和倾向于选择低频特征词的问题.为了克服不足之处,在考虑到词频、集中度和分散度等因素的基础上,考虑文本的长短不均衡和特征词分布,对词频进行归一化,提出了一种改进的卡方统计量特征提取算法.利用经典朴素贝叶斯和支持向量机分类算法在均衡语料、非均衡语料和混合长短文本语料上实验,实验结果表明:新的方法提高了情感文本分类的准确率.%In order to improve the accuracy of sentiment text classification,different preprocessing methods of the sentiment of English text is studied,and an improved algorithm of Chi-square statistic (CHI) feature extraction is put forward.CHI is one of the most efficient feature selection methods,but there are two weaknesses,negative correlation phenomenon and tend to choose low-frequency feature words.In order to overcome these two shortcomings,on the basis of taking into account factors of word frequency,concentration information and dispersion information,considering the length of the text is not balanced and the distribution of feature words,word frequency is normalized,CHI feature extraction algorithm is proposed.Using classical naive Bayes and support vector machine(SVM) classification algorithms experiments is carried out on balanced corpus,imbalanced corpus and mixed-length corpus,and experimental results show that the new method improves accuracy of sentiment text classification.
展开▼