首页> 外文会议>2011 International Conference of Soft Computing and Pattern Recognition >A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence
【24h】

A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence

机译:使用Kullback-Leibler散度的文本分类中特征选择的全局评估标准

获取原文

摘要

A major difficulty of text categorization is extremely high dimensionality of text feature space. The use of feature selection techniques for large-scale text categorization task is desired for improving the accuracy and efficiency. χ2 statistic and simplified χ2 are two effective feature selection methods in text categorization. Using these two feature selection criteria, for a term, one needs to compute the local scores of the term over each category and usually takes the maximum or average value of these scores as the global term-goodness criterion. But there is no explicit explanation on how to choose maximum or average; moreover, these two operations can not reflect the degree of scatter of a term over all categories. In this paper, we propose a new global feature evaluation criterion based on Kullback-Leibler (KL) divergence for choosing informative terms since KL divergence is a widely used method to measure the differences of distributions between two categories. We conduct experiments on Reuters-21578 corpus with k-NN classifier to test the performance of the proposed method. The experimental results show that this method enhances the performance of text categorization. The novel method is similar or better than previous maximum and average on either Macro-F1 or Micro-F1.
机译:文本分类的主要困难是文本特征空间的维数极高。期望将特征选择技术用于大规模文本分类任务以提高准确性和效率。 χ 2 统计和简化χ 2 是文本分类中的两种有效特征选择方法。使用这两个特征选择标准,对于一个术语,需要计算每个类别中该术语的局部得分,并且通常将这些得分的最大值或平均值作为全局术语优度准则。但是对于如何选择最大值或平均值没有明确的解释。而且,这两个操作不能反映一个术语在所有类别中的分散程度。在本文中,我们提出了一种基于Kullback-Leibler(KL)散度的新的全局特征评估标准来选择信息项,因为KL散度是一种广泛用于衡量两类之间分布差异的方法。我们使用k-NN分类器对Reuters-21578语料库进行实验,以测试所提出方法的性能。实验结果表明,该方法提高了文本分类的性能。该新方法与Macro-F1或Micro-F1上的最大值和平均值相似或更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号