The classical Term Frequency and Inverse Documentation Frequency(TFIDF) algorithm neglects the proportion of distribution of terms in categories and between categories of the text collection. Aiming at this problem, this paper introduces the information entropy, and the TFIDF algorithm based on information gain(TFIDFIG) is improved. It proposes a TFIDF algorithm based on information gain and information entropy (TFIDFIGE). Experimental results show that the TFIDFIGE algorithm is more effective than the traditional algorithm, namely TFIDF, TF1DFIG, in terms of precision and recall.%传统的特征词权重算法TFIDF忽略了特征词在类内、类间的分布对其权重的影响.针对该问题,引入信息熵的概念,对基于信息增益的TFIDF算法(TFIDFIG)进行改进,提出一种基于信息增益与信息熵的TFIDF算法(TFIDFIGE).实验结果表明,与传统的TFIDF算法和TFIDFIG算法相比,TFIDFIGE算法的查准率和查全率较高.
展开▼