Feature selection plays an important role in text classification, and contributes directly to the accuracy of the classification. In order to correct the defects, such as mutual information-Based feature selection method tends to select rare words and those words from small samples as features, and negative MI value. This paper proposes a new improved feature evaluation function for automatic text classification by taking word frequency, concentration rate between classes and dispersion within class into overall consideration. According to experimental results, the improved algorithm is well placed to remedy the defect that the original MI evaluation function is prone to select rare words, and can improve the performance of classification significantly.
展开▼