首页> 外文会议>International Conference of Soft Computing and Pattern Recognition >A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence
【24h】

A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence

机译:使用Kullback-Leibler发散的文本分类中的功能选择的全局评估标准

获取原文

摘要

A major difficulty of text categorization is extremely high dimensionality of text feature space. The use of feature selection techniques for large-scale text categorization task is desired for improving the accuracy and efficiency. χ2 statistic and simplified χ2 are two effective feature selection methods in text categorization. Using these two feature selection criteria, for a term, one needs to compute the local scores of the term over each category and usually takes the maximum or average value of these scores as the global term-goodness criterion. But there is no explicit explanation on how to choose maximum or average; moreover, these two operations can not reflect the degree of scatter of a term over all categories. In this paper, we propose a new global feature evaluation criterion based on Kullback-Leibler (KL) divergence for choosing informative terms since KL divergence is a widely used method to measure the differences of distributions between two categories. We conduct experiments on Reuters-21578 corpus with k-NN classifier to test the performance of the proposed method. The experimental results show that this method enhances the performance of text categorization. The novel method is similar or better than previous maximum and average on either Macro-F1 or Micro-F1.
机译:文本分类的主要难度是文本特征空间的极高维度。需要使用特征选择技术进行大规模文本分类任务,以提高准确性和效率。 χ 2 统计和简化χ 2 是文本分类中的两个有效的特征选择方法。使用这两个特征选择标准,一个术语需要计算每个类别的术语的本地分数,并且通常将这些分数的最大值或平均值作为全局术语 - 王位标准。但是没有明确解释如何选择最大或平均值;此外,这两个操作不能反映所有类别的术语的散射程度。在本文中,我们提出了一种基于Kullback-Leibler(KL)发散的新全局特征评估标准,以选择信息性术语,因为KL发散是一种广泛使用的方法来测量两类之间的分布差异。我们对Reuters-21578语料库进行实验,用K-NN分类器进行拟议方法的性能。实验结果表明,该方法提高了文本分类的性能。新的方法比宏F1或Micro-F1上的先前最大和平均值相似或更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号