...
首页> 外文期刊>Algorithmica >Top-k Term-Proximity in Succinct Space
【24h】

Top-k Term-Proximity in Succinct Space

机译:简洁空间中的前k个术语接近

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Let be a collection of D string documents of n characters in total, that are drawn from an alphabet set . The top-k document retrieval problem is to preprocess into a data structure that, given a query , can return the k documents of most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in . For example, it can be the term frequency (i.e., the number of occurrences of P in ), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in ), or a pattern-independent importance score of such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of and solves queries in time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.
机译:假设是从字母表集中抽取的总共n个字符的D字符串文档的集合。前k个文档检索问题是将其预处理为一个数据结构,在给定查询的情况下,该数据结构可以返回与模式P最相关的k个文档。使用预定义的排名函数来捕获相关性,该函数取决于事件集的P。例如,它可以是术语频率(即,P in的出现次数),也可以是术语接近度(即,P in的最接近的出现对之间的距离),或与模式无关的重要性得分,例如PageRank。对于一般的top-k文档检索问题,已经存在线性空间和最佳查询时间解决方案。压缩和紧凑的空间解决方案也是已知的,但仅适用于一些排名函数,例如术语频率和重要性。然而,用于基于术语接近度的检索的空间有效的数据结构已经回避。在本文中,我们介绍了针对此相关函数的第一个亚线性空间数据结构,该结构仅在的任何压缩后缀数组之上使用o(n)位,并及时解决了查询问题。我们还显示,由术语接近度,术语频率和文档重要性的加权组合组成的分数可以使用表示文本集合所需空间的两倍来处理。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号