首页> 外文会议>International Joint Conference on Neural Networks >Class-dependent feature selection algorithm for text categorization
【24h】

Class-dependent feature selection algorithm for text categorization

机译:基于类别的文本分类特征选择算法

获取原文

摘要

A common approach in text categorization is to represent each word as a feature, however, many of these features are irrelevant. So, dimensionality reduction is an important step to diminish the computational effort and to improve accuracy. This paper presents a filter method for feature selection called Category-dependent Maximum f Features per Document (cMFDR). cMFDR is an extension that improves the idea of the MFDR algorithm. In MFDR, the best features are selected exploring documents that overcome a threshold that is calculated for the whole dataset under evaluation. We show that having only one global threshold is not an optimal strategy since it disregards categories that contain few relevant features, impairing the classification precision. So, cMFDR computes one threshold per category to assure that every category contributes with a different number of features. Moreover, the threshold calculation is not biased by documents with large number of features, unlike MFDR. The experimental evaluation showed the effectiveness of cMFDR on four text categorization benchmarks using three feature evaluation functions and Naïve Bayes Multinomial classifier. cMFDR obtains better or similar results than MFDR in 98% of the cases.
机译:文本分类中的一种常见方法是将每个单词表示为一个功能,但是,其中许多功能都是不相关的。因此,降维是减少计算量并提高准确性的重要步骤。本文提出了一种用于特征选择的过滤方法,称为与类别有关的每文档最大f个特征(cMFDR)。 cMFDR是对MFDR算法思想的扩展。在MFDR中,选择最佳功能以浏览文档,这些文档将克服针对评估中的整个数据集计算出的阈值。我们表明,仅拥有一个全局阈值并不是一种最佳策略,因为它忽略了包含很少相关特征的类别,从而损害了分类精度。因此,cMFDR为每个类别计算一个阈值,以确保每个类别贡献不同数量的功能。此外,与MFDR不同,阈值计算不受具有大量功能的文档的偏见。实验评估表明,使用三个功能评估函数和朴素贝叶斯多项式分类器,cMFDR在四个文本分类基准上的有效性。在98%的情况下,cMFDR的结果要比MFDR更好或更相似。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号