首页> 外文期刊>Information and computation >Lempel-Ziv compressed structures for document retrieval
【24h】

Lempel-Ziv compressed structures for document retrieval

机译:Lempel-Ziv压缩结构用于文档检索

获取原文
获取原文并翻译 | 示例

摘要

Document retrieval structures index a collection of string documents, to retrieve those that are relevant to query strings p: document listing retrieves all documents where p appears; top k retrieval retrieves the k most relevant of those. Classical structures use too much space in practice. Most current research uses compressed suffix arrays, but fast indices still use 17-21 bpc (bits per character), whereas small ones take milliseconds per returned answer. We present the first document retrieval structures based on Lempel-Ziv compression, precisely LZ78. Our structures use 7-10 bpc and dominate a large part of the space/time tradeoffs. They also enable more efficient partial or approximate answers: our document listing outputs the first 75%-80% of the answers at a rate of one per microsecond; for top-k retrieval we return a result of 90% quality at the same rate and using just 4-6 bpc. This outperforms current indices by a wide margin. (C) 2019 Elsevier Inc. All rights reserved.
机译:文档检索结构索引字符串文档的集合,以检索与查询字符串p相关的那些文档。文档列表检索出现p的所有文档;前k个检索将检索其中最相关的k个。古典结构在实践中占用太多空间。当前大多数研究使用压缩后缀数组,但是快速索引仍然使用17-21 bpc(每个字符的位数),而小的索引每个返回的答案花费毫秒。我们提出了第一个基于Lempel-Ziv压缩的文档检索结构,即LZ78。我们的结构使用7-10 bpc,并且在空间/时间权衡中占很大比例。它们还可以实现更有效的部分或近似答案:我们的文档清单以每秒每微秒一个的速度输出答案的前75%-80%;对于top-k检索,我们以4-6 bpc的相同速率返回90%的质量结果。这大大超过了当前指数。 (C)2019 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号