Text categorization usually suffers from a huge-scale number of features. Most of those are irrelevant and noise which could mislead the classifier. In order to improve the efficiency and effectiveness for text categorization, feature selection is often performed. In this paper, a novel feature selection approach for dealing with text categorization, called Maximum Information Metric (MIM), is proposed to get good quality terms of documents. This method exploits the weight of term and document frequency to construct the correlation between a term and each class. It aims to maximize the differences of term over each class based on information theory. We design a better evaluation function to yield a kind of ranking of the features. Experimental results on the standard Reuters-21578 and 20-Newsgroups corpus show that the new feature selection approach outperforms the classic methods including Information Gain (IG), Chi-square statistic (CHI) in a context of text categorization.
展开▼