...
首页> 外文期刊>Engineering Applications of Artificial Intelligence >Using modified term frequency to improve term weighting for text classification
【24h】

Using modified term frequency to improve term weighting for text classification

机译:使用修改的术语频率来改进文本分类的术语加权

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Text classification (TC) is an essential task of natural language processing (NLP). In order to improve the performance of TC, term weighting is often used to obtain effective text representation by assigning appropriate weights to each term. A term weighting scheme is generally composed of term frequency factor, collection frequency factor and normalization factor. The normalization factor is commonly used as an optional factor to offset the influence of document length. Through the investigation of the existing term weighting schemes, we found that most of them focus on finding a more effective collection frequency factor, but rarely pay attention to finding a new term frequency factor. In this paper, we first proposed a new term frequency factor called modified term frequency (MTF). Different from the normalization factor, MTF directly modifies the raw term frequency based on the length information of all training documents. Then we proposed a new term weighting scheme by combining MTF with an existing collection frequency factor called modified distinguishing feature selector (MDFS). We denoted our scheme by MTF-MDFS (MDFS-based MTF). Extensive experimental results on 19 benchmark text datasets and 6 real-world text datasets show that our proposed MTF and MTF-MDFS are all much better than their state-of-the-art competitors in terms of the classification accuracy and the weighted average of F_1 of widely used base classifiers, such as MNB, SVM and LR.
机译:文本分类(TC)是自然语言处理(NLP)的必要任务。为了提高TC的性能,术语加权通常用于通过将适当的权重分配给每个术语来获得有效的文本表示。术语加权方案通常由术语频率因子,收集频率因子和归一化因子组成。归一化因子通常用作偏移文档长度的影响的可选因子。通过调查现有术语加权计划,我们发现大多数人都专注于找到更有效的收集频率因子,但很少注意寻找新的术语频率因子。在本文中,我们首先提出了一种称为修改术语频率(MTF)的新术语频率因子。与归一化因子不同,MTF基于所有培训文档的长度信息直接修改原始术语频率。然后,我们通过将MTF与称为修改区分特征选择器(MDF)的现有收集频率因子组合来提出了一种新的术语加权方案。我们用MTF-MDFS(基于MDFS的MTF)表示了我们的计划。关于19个基准文本数据集和6个现实世界文本数据集的广泛实验结果表明,我们提出的MTF和MTF-MDF在分类准确性和F_1的加权平均值方面都比其最先进的竞争对手更好广泛使用的基础分类器,如MNB,SVM和LR。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号