首页> 外文会议>Workshop on Algorithm Engineering and Experiments >Improved Single-Term Top-k Document Retrieval
【24h】

Improved Single-Term Top-k Document Retrieval

机译:改进单学期Top-K文件检索

获取原文

摘要

On natural language text collections, finding the k documents most relevant to a query is generally solved with inverted indexes. On general string collections, however, more sophisticated data structures are necessary. Navarro and Nekrich [SODA 2012] showed that a linear-space index can solve such top-k queries in optimal time O(m +k), where m is the query length. Konow and Navarro [DCC 2013] implemented the scheme, managing to solve top-k queries within microseconds with an index using 3.3-4.0 bytes per character (this includes the storage of the collection itself). In this paper we introduce a new implementation using significantly less space, 2.5-3.0 bytes per character (again, including the collection), and retaining similar query times. For short queries, which are the most difficult, our new index actually outperforms the previous one, as well as all the other solutions in the literature. We also show that our index can be built on very large text collections, and that it can handle phrase queries efficiently on natural language text collections. In the latter case, it uses about the same space of the tokenized text (and replaces it), while answering phrase queries an order of magnitude faster than a positional inverted index.
机译:在自然语言文本集合上,找到与查询最相关的K文档通常用反相索引解决。但是,在常规字符串集合上,需要更复杂的数据结构。 Navarro和Nekrich [Soda 2012]表明,线性空间索引可以在最佳时间O(M + k)中解决此类顶级查询,其中M是查询长度。 Konow和Navarro [DCC 2013]实现了该方案,管理旨在在微秒内解决顶级k查询,其中索引使用每字符3.3-4.0字节(这包括集合本身的存储)。在本文中,我们介绍了一个使用显着更少的空间,每字符2.5-3.0字节的新实现(再次,包括集合),并保留类似的查询时间。对于短期查询,这是最困难的,我们的新指数实际上优于前一个,以及文献中的所有其他解决方案。我们还表明,我们的索引可以在非常大的文本集合上构建,并且它可以在自然语言文本集合上有效地处理短语查询。在后一种情况下,它使用了令牌化文本的相同空间(并替换它),同时应答短语查询比位置反转索引快于位置速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号