【24h】

Vocabulary Filtering for Term Weighting in Archived Question Search

机译:归档问题搜索中用于词汇加权的词汇过滤

获取原文

摘要

This paper proposes the notion of vocabulary filtering in a term weighting framework that consists of three filters at the document level, collection level, and vocabulary level. While term frequency and document frequency along with their variations are respectively the dominant term weighting factors at the document level and collection level, vocabulary level factors are seldom considered in current models. In a way, stopword removal can be seen as a vocabulary level filter, but it is not well integrated into the current term-weighting models. In this paper, we propose a vocabulary filtering and multi-level term weighting model by integrating point-wise divergence based measure into the commonly used TF-IDF model. With our proposed model, the specificity of the vocabulary is captured as a new factor in term weighting, and stopwords are naturally handled within the model rather than being removed according to a separately constructed list. Experiments conducted on searching for similar questions in a large community-based question answering archive show that: (a)our proposed term weighting model with multiple levels is consistently better than those with single level for retrieval task; (b)the proposed vocabulary filter well distinguishes salient and trivial terms, and can be utilized to construct stopword lists.
机译:本文提出了术语加权框架中的词汇过滤概念,该术语加权框架由文档级别,集合级别和词汇级别的三个过滤器组成。虽然术语频率和文档频率及其变化分别是文档级别和收集级别的主要术语加权因子,但在当前模型中很少考虑词汇级别的因子。从某种意义上说,停用词删除可以看作是词汇量过滤器,但是它没有很好地集成到当前的术语加权模型中。在本文中,我们通过将基于点向散度的量度集成到常用的TF-IDF模型中,提出了词汇过滤和多级术语加权模型。使用我们提出的模型,词汇的特殊性被捕获为术语权重的新因素,停用词在模型中自然处理,而不是根据单独构造的列表将其删除。在基于社区的大型问答档案库中搜索相似问题的实验表明:(a)我们提出的多级术语加权模型始终优于单级检索任务; (b)拟议的词汇过滤器很好地区分了显着和琐碎的术语,并可用于构建停用词列表。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号