Main Results. The basic string matching problem is to determine the occurrences of a short pattern P = p_1p_2 ... p_m in a large text T = t_1t_2 ... t_n, over an alphabet of size a. Indexes are structures built on the text to speed up searches, but they used to take up much space. In recent years, succinct text indexes have appeared. A prominent example is the FM-index, which takes little space (close to that of the compressed text) and replaces the text, as it can search the text (in optimal O(m) time) and reproduce any text substring without accessing it. The main problem of the FM-index is that its space usage depends exponentially on σ, that is, 5H_kn + σ~σ o(n) for any k, H_k being the k-th order entropy of T. In this paper we present a simple variant of the FM-index, which removes its alphabet dependence. We achieve this by, essentially (but not exactly), Huffman-compressing the text and FM-indexing the binary sequence. Our index needs 2n(H_0 + 1)(1 + o(1)) bits, independent of σ, and it searches in O(m(H_0 + 1)) average time, which can be made O(m log σ) in the worst case. Moreover, our index is considerably simpler to implement than most other succinct indexes.
展开▼