Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Grossi R; Vitter JS

首页> 外文期刊>SIAM Journal on Computing >Compressed suffix arrays and suffix trees with applications to text indexing and string matching

【24h】

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

机译：压缩后缀数组和后缀树及其在文本索引和字符串匹配中的应用

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Sigma. The text T can be represented in n lg |Sigma| bits by encoding each symbol with lg |Sigma| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require O( n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Omega(n) memory words, each of Omega(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Omega(lg| Sigma| n), which is significant when Sigma is of constant size, such as in ASCII or UNICODE. On the other hand, these indexes support fast searching, either in O(m lg |Sigma|) time or in O(m+ lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/lg(|Sigma|) n + lg(|Sigma|)(epsilon) n) search time in the worst case, for any constant 0 < epsilon <= 1, using at most (epsilon(-1) + O(1)) n lg |Sigma| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed su. x array for a typical 100 MB ASCII file can require 30 - 40 MB or less, while the raw su. x array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg(|Sigma|)(epsilon) n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O( n lg |Sigma|) bits to obtain a total search bound of O(m/lg(|Sigma|) n + occ) time, which is optimal.

机译：诸如在万维网和在线数据库中发现的在线文本的激增激发了对支持快速字符串搜索的节省空间的文本索引方法的需求。我们对这种情况建模如下：考虑一个文本T，该文本T由从固定字母Sigma绘制的n个符号组成。文本T可以用n lg | Sigma |表示通过使用lg | Sigma |对每个符号进行编码位。目标是支持快速在线查询，以搜索m个符号的任何字符串模式P，T仅被完全扫描一次，即在预处理时创建索引时。文献中发表的文本索引方案在空间使用方面很贪心：在最坏的情况下，它们需要O（n lg n）个额外的空间位。例如，在标准单位成本RAM中，后缀树和后缀数组需要Omega（n）存储字，每个Omega（lg n）位。这些索引比文本本身大Omega（lg | Sigma | n）的乘数，当Sigma的大小固定时，例如ASCII或UNICODE，这是很重要的。另一方面，这些索引支持快速搜索，无论是O（m lg | Sigma |）时间还是O（m + lg n）时间，再加上用于列出occ模式出现情况的输出敏感成本O（occ）。我们提出了一个新的文本索引，该索引基于后缀数组和后缀树的压缩表示形式。在最坏的情况下，对于任何常数0

著录项

来源
《SIAM Journal on Computing》 |2005年第2期|共30页
作者
Grossi R; Vitter JS;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类应用数学;
关键词
compression; text indexing; text retrieval; compressed data structures; suffix arrays; suffix trees; string searching; pattern matching; INVERTED FILES; EFFICIENT IMPLEMENTATION; ONE ERROR; SPACE; CONSTRUCTION; TIME; RETRIEVAL; DICTIONARIES; COMPLEXITY; ALGORITHM;

机译：压缩;文本索引;文本检索;压缩的数据结构;后缀数组;后缀树;字符串搜索;模式匹配;反转文件;有效实现;一个错误;空间;构造;时间;检索;字典;复杂性;算法;

相似文献

外文文献
中文文献
专利

1. Compressed suffix arrays and suffix trees with applications to text indexing and string matching [J] . Grossi R, Vitter JS SIAM Journal on Computing . 2005,第2期

机译：压缩后缀数组和后缀树及其在文本索引和字符串匹配中的应用
2. Approximate string matching using compressed suffix arrays [J] . Trinh N. D. Huynh, Wing-Kai Hon, Tak-Wah Lam, Theoretical computer science . 2006,第1a3期

机译：使用压缩后缀数组进行近似字符串匹配
3. Improved Approximate String Matching Using Compressed Suffix Data Structures [J] . Tak-Wah Lam, Wing-Kin Sung, Swee-Seong Wong Algorithmica . 2008,第3期

机译：使用压缩后缀数据结构的改进的近似字符串匹配
4. Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) [C] . Roberto Grossi, Jeffrey Scott Vitter Annual ACM symposium on Theory of computing;ACM symposium on Theory of computing . 2000

机译：压缩后缀数组和后缀树及其在文本索引和字符串匹配中的应用（扩展摘要）
5. Suffix trees and suffix arrays in primary and secondary storage [D] . Ko, Pang 2007

机译：主存储和辅助存储中的后缀树和后缀数组
6. Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem [O] . Maan Haj Rachid, Qutaibah Malluhi, Mohamed Abouelhoda -1

机译：使用Sadakane压缩后缀树来解决全对后缀前缀问题
7. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching [O] . GROSSI R, VITTER J.S. 2006

机译：压缩后缀数组和后缀树及其在文本索引和字符串匹配中的应用

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

摘要

著录项

相似文献

相关主题

期刊订阅