首页> 外文会议>Computational Linguistics and Intelligent Text Processing; Lecture Notes in Computer Science; 4394 >Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization
【24h】

Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

机译:利用类别信息和文档信息来提高术语加权以进行文本分类

获取原文
获取原文并翻译 | 示例

摘要

Traditional tfidf-like term weighting schemes have a rough statistic — idf as the term weighting factor, which does not exploit the category information (category labels on documents) and intra-document information (the relative importance of a given term to a given document that contains it) from the training data for a text categorization task. We present here a more elaborate nonparametric probabilistic model to make use of this sort of information in the term weighting phase, idf is theoretically proved to be a rough approximation of this new term weighting factor. This work is preliminary and mainly aiming at providing inspiration for further study on exploitation of this information, but it already provides a moderate performance boost on three popular document collections.
机译:传统的类似tfidf的术语加权方案具有一个粗略的统计数据-idf作为术语加权因子,它不利用类别信息(文档上的类别标签)和文档内信息(给定术语对给定文档的相对重要性,即包含它)来自文本分类任务的训练数据。我们在这里提出一个更复杂的非参数概率模型,以在术语加权阶段使用这种信息,理论上证明了idf是这个新术语加权因子的粗略近似。这项工作是初步的,主要目的是为进一步研究此信息提供灵感,但已为三个流行的文档集提供了适度的性能提升。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号