首页> 外文期刊>Artificial Intelligence Research >An empirical evaluation of text classification and feature selection methods
【24h】

An empirical evaluation of text classification and feature selection methods

机译:文本分类和特征选择方法的实证评估

获取原文
获取原文并翻译 | 示例
           

摘要

An extensive empirical evaluation of classifiers and feature selection methods for text categorization is presented. More than 500 models were trained and tested using different combinations of corpora, term weighting schemes, number of features, feature selection methods and classifiers. The performance measures used were micro-averaged F measure and classifier training time. The experiments used five benchmark corpora, three term weighting schemes, three feature selection methods and four classifiers. Results indicated only slight performance improvement with all the features over only 20% features selected using Information Gain and Chi Square. More importantly, this performance improvement was not deemed statistically significant. Support Vector Machine with linear kernel reigned supreme for text categorization tasks producing highest F measures and low training times even in the presence of high class skew. We found statistically significant difference between the performance of Support Vector Machine and other classifiers on text categorization problems.
机译:提出了广泛的实证评估文本分类器和特征选择方法。使用语料库,术语权重方案,特征数量,特征选择方法和分类器的不同组合对500多个模型进行了训练和测试。所使用的性能指标是微平均F指标和分类器训练时间。实验使用了五种基准语料库,三种术语加权方案,三种特征选择方法和四种分类器。结果表明,使用“信息增益”和“卡方”选择的所有功能仅超过20%时,所有功能的性能仅略有改善。更重要的是,这种性能改善在统计上并不重要。具有线性核的Support Vector Machine在文本分类任务中占据上风,即使在出现高级偏斜的情况下,也能产生最高的F度量和较短的训练时间。我们发现在文本分类问题上,支持向量机和其他分类器的性能在统计上有显着差异。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号