首页> 外文会议>String processing and information retrieval >String Retrieval for Multi-pattern Queries
【24h】

String Retrieval for Multi-pattern Queries

机译:多模式查询的字符串检索

获取原文
获取原文并翻译 | 示例

摘要

Given a collection D of string documents {d_1,d_2, ...,d|ρ|} of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P_1, P_2,..., P_m}- To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores). When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took Q(n~(3/2)) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(∑ |P_i|) + Q(t~(1/m)n~(1-1/m)) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of O(|P_i| + |P_2| + 1t log~2 n). We also show space-time trade-offs for our data structures; Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.
机译:给定总长度为n的字符串文档{d_1,d_2,...,d |ρ|}的集合D可以进行预处理,则基本任务是为给定查询检索最相关的文档。该查询由一组m个模式{P_1,P_2,...,P_m}组成-为了衡量文档与查询模式的相关性,我们可以定义一个分数,例如这些模式的出现次数在文档中,或者文档中给定样式的接近度。为了控制输出的大小,我们还可以指定一个阈值(或参数K),以便我们的任务是报告与查询匹配的所有文档,其得分均大于阈值(或分别是得分最高的K个文档)。分数)。当文档是字符串(没有单词边界)时,传统的基于索引的解决方案可能不适用。 [14,9]已经很好地解决了单一模式检索的情况。当涉及到两个或更多模式时,[14]给出了唯一的用于邻近搜索和公共文档列表的非平凡解,它占用了Q(n〜(3/2))空间。在本文中,我们给出了第一个线性空间(部分简洁)的数据结构,该结构可以回答O(∑ | P_i |)+ Q(t〜(1 / m)n〜(1-1 / m))时间,其中t是输出出现的次数。在两种模式的特定情况下,我们达到O(| P_i | + | P_2 | + 1 / nt log〜2 n)的边界。我们还显示了数据结构的时空权衡;我们的方法基于一种称为权重平衡小波树的新颖数据结构,该结构可能会引起人们的关注。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号