首页> 外文期刊>Algorithms >Approximate String Matching with Compressed Indexes
【24h】

Approximate String Matching with Compressed Indexes

机译:具有压缩索引的近似字符串匹配

获取原文
       

摘要

A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T, thus actually replacing T. Despite the recent explosion of interest on compressed indexes, there has not been much progress on functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in bioinformatics. We study ASM algorithms for Lempel-Ziv compressed indexes and for compressed suffix trees/arrays. Most compressed self-indexes belong to one of these classes. We start by adapting the classical method of partitioning into exact search to self-indexes, and optimize it over a representative of either class of self-index. Then, we show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to a Lempel-Ziv index. Finally, we improve hierarchical verification, a successful technique for sequential searching, so as to extend the matches of pattern pieces to the left or right. Most compressed suffix trees/arrays support the required bidirectionality, thus enabling the implementation of the improved technique. In turn, the improved verification largely reduces the accesses to the text, which are expensive in self-indexes. We show experimentally that our algorithms are competitive and provide useful space-time tradeoffs compared to classical indexes.
机译:文本T的压缩的全文本自我索引是一种数据结构,需要减少空间并能够搜索T中的模式P。它还可以再现T的任何子字符串,从而实际上替代T。尽管最近对T感兴趣,压缩索引后,除了基本的精确搜索之外,在功能上还没有取得太大进展。在本文中,我们关注于索引近似字符串匹配(ASM),这在生物信息学中引起了极大的兴趣。我们研究了Lempel-Ziv压缩索引和压缩后缀树/数组的ASM算法。大多数压缩的自索引属于这些类之一。我们首先将对精确搜索进行划分的经典方法改编为自索引,然后针对任一类自索引的代表对其进行优化。然后,我们证明Lempel-Ziv索引可以看作是经典q样本索引的扩展。我们对这种类型的索引提供了新的见解,它们可能具有独立的意义,然后将其应用于Lempel-Ziv索引。最后,我们改进了层次验证,这是一种成功的顺序搜索技术,以便将图案片段的匹配范围扩展到左侧或右侧。大多数压缩后缀树/数组支持所需的双向性,因此可以实现改进的技术。反过来,改进的验证功能则大大减少了对文本的访问,这在自索引中非常昂贵。我们通过实验证明,与经典索引相比,我们的算法具有竞争力,并提供了有用的时空权衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号