首页> 外文期刊>ACM Transactions on Information Systems >Set-based vector model: an efficient approach for correlation-based ranking
【24h】

Set-based vector model: an efficient approach for correlation-based ranking

机译:基于集合的向量模型:一种基于相关性排名的有效方法

获取原文
获取原文并翻译 | 示例

摘要

This work presents a new approach for ranking documents in the vector space model. The novelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed efficiently. Second, term weights are generated using a data mining technique called association rules. This leads to a new ranking mechanism called the set-based vector model. The components of our model are no longer index terms but index termsets, where a termset is a set of index terms. Termsets capture the intuition that semantically related terms appear close to each other in a document. They can be efficiently obtained by limiting the computation to small passages of text. Once termsets have been computed, the ranking is calculated as a function of the termset frequency in the document and its scarcity in the document collection. Experimental results show that the set-based vector model improves average precision for all collections and query types evaluated, while keeping computational costs small. For the 2-gigabyte TREC-8 collection, the set-based vector model leads to a gain in average precision figures of 14.7% and 16.4% for disjunctive and conjunctive queries, respectively, with respect to the standard vector space model. These gains increase to 24.9% and 30.0%, respectively, when proximity information is taken into account. Query processing times are larger but, on average, still comparable to those obtained with the standard vector model (increases in processing time varied from 30% to 300%). Our results suggest that the set-based vector model provides a correlation-based ranking formula that is effective with general collections and computationally practical.
机译:这项工作提出了一种在向量空间模型中对文档进行排名的新方法。新颖性在于两个方面。首先,要考虑术语共现的模式并进行有效处理。其次,术语权重是使用称为关联规则的数据挖掘技术生成的。这导致了一种新的排名机制,称为基于集合的矢量模型。我们模型的组成部分不再是索引术语,而是索引术语集,其中术语集是一组索引术语。术语集反映了直觉,即与语义相关的术语在文档中看起来彼此接近。通过将计算限制为小段文字可以有效地获得它们。一旦计算出术语集,就根据文档中术语集频率及其在文档集合中的稀缺性来计算排名。实验结果表明,基于集合的向量模型提高了所有集合和所评估查询类型的平均精度,同时保持了较小的计算成本。对于2 GB的TREC-8集合,与标准向量空间模型相比,基于集的向量模型对析取和合并查询的平均精度分别提高了14.7%和16.4%。考虑到邻近信息,这些收益分别增加到24.9%和30.0%。查询处理时间更长,但平均而言,仍与标准向量模型所获得的时间相当(处理时间从30%到300%不等)。我们的结果表明,基于集合的向量模型提供了一种基于相关性的排序公式,该公式对于一般集合有效并且在计算上是实用的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号