首页> 外文会议>Information Science and Engineering (ICISE), 2009 >Feature Selection with Maximum Information Metric in Text Categorization
【24h】

Feature Selection with Maximum Information Metric in Text Categorization

机译:文本分类中具有最大信息量的特征选择

获取原文

摘要

Text categorization usually suffers from a huge-scale number of features. Most of those are irrelevant and noise which could mislead the classifier. In order to improve the efficiency and effectiveness for text categorization, feature selection is often performed. In this paper, a novel feature selection approach for dealing with text categorization, called Maximum Information Metric (MIM), is proposed to get good quality terms of documents. This method exploits the weight of term and document frequency to construct the correlation between a term and each class. It aims to maximize the differences of term over each class based on information theory. We design a better evaluation function to yield a kind of ranking of the features. Experimental results on the standard Reuters-21578 and 20-Newsgroups corpus show that the new feature selection approach outperforms the classic methods including Information Gain (IG), Chi-square statistic (CHI) in a context of text categorization.
机译:文本分类通常会遭受大量功能的困扰。其中大多数是无关紧要的,可能会误导分类器。为了提高文本分类的效率和有效性,经常执行特征选择。本文提出了一种用于文本分类的新颖特征选择方法,称为最大信息量度(MIM),以获取高质量的文档术语。该方法利用术语的权重和文档频率来构造术语与每个类别之间的相关性。它旨在基于信息论最大化每个类别上的术语差异。我们设计了一个更好的评估函数,以对特征进行排序。在标准Reuters-21578和20-Newsgroups语料库上的实验结果表明,在文本分类的上下文中,新的特征选择方法优于经典方法,包括信息增益(IG),卡方统计(CHI)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号