String Retrieval for Multi-pattern Queries

机译：多模式查询的字符串检索

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Given a collection D of string documents {d_1,d_2, ...,d|ρ|} of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P_1, P_2,..., P_m}- To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores). When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took Q(n~(3/2)) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(∑ |P_i|) + Q(t~(1/m)n~(1-1/m)) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of O(|P_i| + |P_2| + 1t log~2 n). We also show space-time trade-offs for our data structures; Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.

机译：给定总长度为n的字符串文档{d_1，d_2，...，d |ρ|}的集合D可以进行预处理，则基本任务是为给定查询检索最相关的文档。该查询由一组m个模式{P_1，P_2，...，P_m}组成-为了衡量文档与查询模式的相关性，我们可以定义一个分数，例如这些模式的出现次数在文档中，或者文档中给定样式的接近度。为了控制输出的大小，我们还可以指定一个阈值（或参数K），以便我们的任务是报告与查询匹配的所有文档，其得分均大于阈值（或分别是得分最高的K个文档）。分数）。当文档是字符串（没有单词边界）时，传统的基于索引的解决方案可能不适用。 [14,9]已经很好地解决了单一模式检索的情况。当涉及到两个或更多模式时，[14]给出了唯一的用于邻近搜索和公共文档列表的非平凡解，它占用了Q（n〜（3/2））空间。在本文中，我们给出了第一个线性空间（部分简洁）的数据结构，该结构可以回答O（∑ | P_i |）+ Q（t〜（1 / m）n〜（1-1 / m））时间，其中t是输出出现的次数。在两种模式的特定情况下，我们达到O（| P_i | + | P_2 | + 1 / nt log〜2 n）的边界。我们还显示了数据结构的时空权衡；我们的方法基于一种称为权重平衡小波树的新颖数据结构，该结构可能会引起人们的关注。

著录项

来源
《String processing and information retrieval》|2010年|p.55-66|共12页
会议地点 Los Cabos(MX);Los Cabos(MX)
作者
Wing-Kai Hon; Rahul Shah; Sharma V. Thankachan; Jeffrey Scott Vitter;
展开▼
作者单位

Department of CS, National Tsing Hua University, Taiwan;

Department of CS, Louisiana State University, USA;

Department of CS, Louisiana State University, USA;

Department of EECS, The University of Kansas, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类信息处理（信息加工）;
关键词
入库时间 2022-08-26 14:06:26

相似文献

外文文献
中文文献
专利

1. Cross Media Passage Level Retrieval—Access Method to Spoken Documents by Telop and CG Flip Character Strings as Queries [J] . Seiichi Takao, Yasuo Ariki, Jun Ogata Electronics and Communications in Japan. Part 2, Electronics . 2005,第10期

机译：跨媒体通过级别检索—通过Telop和CG翻转字符串作为查询对口述文档的访问方法
2. Multi-Pattern Matching for Dictionary Compressed Strings [J] . Chen Hou, Meng Zhang, Hengshan Yue, Sensor Letters: A Journal Dedicated to all Aspects of Sensors in Science, Engineering, and Medicine . 2014,第2期

机译：字典压缩字符串的多模式匹配
3. Efficient bit-parallel multi-patterns approximate string matching algorithms [J] . Rajesh Prasad, Anuj Kumar Sharma, Alok Singh, Scientific Research and Essays . 2011,第4期

机译：高效的位并行多模式近似字符串匹配算法
4. String Retrieval for Multi-pattern Queries [C] . Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, International Symposium on String Processing and Information Retrieval . 2010

机译：用于多模式查询的字符串检索
5. Multi-pattern string matching algorithms. [D] . Zha, Xinyan. 2010

机译：多模式字符串匹配算法。
6. Query expansion using MeSH terms for dataset retrieval: OHSU at the bioCADDIE 2016 dataset retrieval challenge [O] . Theodore B Wright, David Ball, William Hersh 2017

机译：使用MeSH术语进行数据集检索的查询扩展：OHSU在bioCADDIE 2016数据集检索挑战中
7. Highly Compressed Multi-pattern String Matching on the Cell Broadband Engine [O] . Xinyan Zha, Daniele Paolo Scarpazza, Sartaj Sahni 2011

机译：单元宽带引擎上的高度压缩的多模式字符串匹配
8. HDF5-Fast Query: An API for Simplifying Access to Data Storage, Retrieval, Indexing and Querying [R] . Bethel, E. W. 2006

机译：HDF5快速查询：用于简化数据存储，检索，索引和查询访问的apI

String Retrieval for Multi-pattern Queries

摘要

著录项

相似文献

相关主题

期刊订阅