首页> 外文期刊>Expert Systems with Application >Comparison of term frequency and document frequency based feature selection metrics in text categorization
【24h】

Comparison of term frequency and document frequency based feature selection metrics in text categorization

机译:术语分类中基于术语频率和文档频率的特征选择指标的比较

获取原文
获取原文并翻译 | 示例

摘要

Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets.
机译:文本分类在信息被过滤,监视,个性化,分类,组织或搜索的应用程序中起着重要作用。在文本分类中,特征选择仍然是一种有效的技术。特征选择度量通常基于单词的词频或文档频度。我们专注于这些频率对于特征选择指标的相对重要性。为此,使用术语频率检查了基于文档频率的判别功率测量和GINI指标。在Reuters 21,578数据集上对指标进行了比较和分析。实验结果表明,术语“基于频率的度量”可能特别有用,特别是对于较小的功能集。通过分析类别之间特征的分散性和数据中信息被覆盖的速率,观察到了基于术语频率的度量标准的两个特征。对于较小的功能集,这些特性可能有助于其优越的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号