首页> 外文会议>2012 Fourth International Symposium on Information Science and Engineering. >A Comparative Study on Feature Selection in Unbalance Text Classification
【24h】

A Comparative Study on Feature Selection in Unbalance Text Classification

机译:不平衡文本分类中特征选择的比较研究

获取原文
获取原文并翻译 | 示例

摘要

Feature selection plays an important role in text classification. Unbalanced text classification is a kind of special classification problem, which is widely used in practice. However, what is the most effective method on unbalanced text classification? As we all know there was not a systematic research about these feature selection methods on unbalanced text classification. This paper is a comparative study of feature selection methods in this problem. The focus is on aggressive dimensionality reduction. We run our experiments on both Chinese and English corpus. Seven methods were evaluated, including term selection based on document frequency (DF), information gain(IG), CH feature selection method, mutual information(MI), expected cross entropy (ECE), the weight of evidence for text (WET) and odds ratio (ODD). We found ODD and WET most effective in two-class classification task, in contrast, IG and CHI had relatively poor performance due to their bias towards favoring rare terms, and its sensitivity to probability estimation errors. However, in multi-class task, the IG and CHI perform had a better performance but MI perform poorly.
机译:特征选择在文本分类中起着重要作用。不平衡文本分类是一种特殊的分类问题,在实践中得到了广泛的应用。但是,不平衡文本分类最有效的方法是什么?众所周知,关于不平衡文本分类的这些特征选择方法尚无系统的研究。本文是针对该问题的特征选择方法的比较研究。重点是积极降低尺寸。我们对中文和英文语料库进行实验。评估了7种方法,包括基于文档频率(DF),信息增益(IG),CH特征选择方法,互信息(MI),期望交叉熵(ECE),文本证据权重(WET)和比值比(ODD)。我们发现ODD和WET在两类分类任务中最有效,相反,IG和CHI由于偏爱稀有术语以及对概率估计误差的敏感性而表现相对较差。但是,在多类任务中,IG和CHI表现较好,而MI表现较差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号