n-grams作为文本分类特征时易造成分类准确率下降,并且在对n-grams加权时通常忽略单词间的冗余度和相关性.针对上述问题,文中提出基于相关性及语义的n-grams特征加权算法.在文本预处理时,对n-grams进行特征约简,降低内部冗余,再根据n-grams内单词与类别的相关性及n-grams与测试集的语义近似度加权.搜狗中文新闻语料库和网易文本分类语料库上的实验表明,文中算法能筛选高类别相关且低冗余的n-grams特征,在量化测试集时减少稀疏数据的产生.%When n-grams are considered as text classification features, the classification accuracy is decreased. The redundancy and relevance between words are ignored while n-grams are weighted. Thus, n-grams features weighting algorithm based on relevance and semantic is proposed. To decrease the internal redundancy, feature reduction is conducted to n-grams during text preprocessing. Then, n-grams are weighted according to the relevance of words and classes in n-grams and the semantic similarity of n-grams and testing dataset. The experimental results on Sougo Chinese news corpse and NetEase text corpse show that the proposed algorithm can select n-grams features of high relevance and low redundancy, and reduce the sparse data while quantifying the testing dataset.
展开▼