首页> 外文期刊>Pattern Analysis and Machine Intelligence, IEEE Transactions on >Supervised and Traditional Term Weighting Methods for Automatic Text Categorization
【24h】

Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

机译:自动文本分类的监督和传统术语加权方法

获取原文
获取原文并翻译 | 示例

摘要

In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
机译:在向量空间模型(VSM)中,文本表示的任务是将文本文档的内容转换为术语空间中的向量,以便可以由计算机或分类器识别和分类文档。不同的术语(即单词,词组或用于标识文本内容的任何其他索引单元)在文本中的重要性不同。术语加权方法为术语分配适当的权重,以提高文本分类的性能。在这项研究中,我们结合SVM和kNN算法,研究了基准数据集上几种广泛使用的非监督(传统)和监督术语加权方法。考虑到集合中相关文档的分布,我们提出了一种新的简单的监督术语加权方法,即tf.rf,以提高术语对文本分类任务的区分能力。从受控的实验结果来看,这些监督术语加权方法具有混合性能。具体而言,我们提出的监督术语加权方法tf.rf始终比其他术语加权方法具有更好的性能,而其他基于信息论或统计指标的监督术语加权方法在所有实验中表现最差。另一方面,就不同的数据集而言,普遍使用的tf.idf方法没有表现出统一的良好性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号