首页> 外文会议>Workshop on Algorithm Engineering and Experiments >CSA++: Fast Pattern Search for Large Alphabets
【24h】

CSA++: Fast Pattern Search for Large Alphabets

机译:CSA ++:快速模式搜索大字母表

获取原文

摘要

Indexed pattern search in text has been studied for many decades. For small alphabets, the FM-Index provides unmatched performance for Count operations, in terms of both space required and search speed. For large alphabets - for example, when the tokens are words - the situation is more complex, and FM-Index representations are compact, but potentially slow. In this paper we apply recent innovations from the field of inverted indexing and document retrieval to compressed pattern search, including for alphabets into the millions. Commencing with the practical compressed suffix array structure developed by Sadakane, we show that the Elias-Fano code-based approach to document indexing can be adapted to provide new trade-off options in indexed pattern search, and offers significantly faster pattern processing compared to previous implementations, as well as reduced space requirements. We report a detailed experimental evaluation that demonstrates the relative advantages of the new approach, using the standard Pizza & Chili methodology and files, as well as applied use-cases derived from large-scale data compression, and from natural language processing. For large alphabets, the new structure gives rise to space requirements that are close to those of the most highly-compressed FM-Index variants, in conjunction with unparalleled Count throughput rates.
机译:几十年来研究了文本中的索引模式搜索。对于小字母,FM-Index在需要两个所需空间和搜索速度方面提供无与伦比的计数操作性能。对于大字母 - 例如,当令牌是单词 - 情况更复杂,并且FM-Index表示紧凑,但可能慢。在本文中,我们将最近的创新从倒置索引和文件检索到压缩模式搜索,包括字母表进入数百万。与Sadakane开发的实用压缩后缀阵列结构开始,我们表明基于ELIAS-Fano代码的文档索引方法可以适用于在索引模式搜索中提供新的折衷选项,并与上一个相比提供更快的模式处理。实现,以及减少空间要求。我们报告了一个详细的实验评估,展示了使用标准披萨和辣椒方法和文件的新方法的相对优势,以及从大规模数据压缩以及自然语言处理的应用程序使用情况。对于大字母,新结构导致与最高压缩的FM-INDEX VARIANTS接近的空间要求,与无与伦比的计数吞吐率相结合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号