首页> 外文会议>Annual international conference on research in computational molecular biology >EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices
【24h】

EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

机译:EPR词典:一种实用且快速的数据结构,用于在单向和双向FM索引中进行恒定时间搜索

获取原文

摘要

The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If σ is the size of the alphabet then the method of Lam et al. can conduct one step in time O(σ) while needing space O(σ · n) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to O(logσ ) while using O(log σ · n) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees. In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in 0(1) time per step while using O(log σ · n) + o(log σ · σ · n) bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary). We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between ≈ 2.2-4.2 times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.
机译:单向FM索引由Ferragina和Manzini于2000年引入,可以在一个方向上搜索索引中的模式。双向FM指数(2FM)由Lam等人介绍。在2009年。它允许通过向左或向右任意扩展模式的前缀来搜索模式。如果σ是字母的大小,则采用Lam等人的方法。可以对位向量进行恒定时间等级查询,从而在需要空间O(σ·n)的同时进行时间O(σ)的一步。 Schnattinger和他的同事这次将FM和2FM索引都使用了O(logσ·n)位空间,这次将其改进为O(logσ)。这是通过使用二进制小波树来实现的。在本文中,我们介绍了一种新的实用方法,可在使用O(logσ·n)+ o(logσ·σ·n)的情况下,以每步0(1)次的时间在单向和双向FM索引中进行精确搜索。一点空间。这是通过用新的数据结构,即增强的前缀和秩字典(EPR-dictionary)替换二进制小波树来完成的。我们在SeqAn C ++库中实现了该方法,并通过实验验证了我们的理论结果。此外,我们将我们的实现与双向索引的其他免费提供的实现进行了比较,结果表明我们的速度快了≈2.2-4.2倍。这将对许多依赖(2)FM索引的实际实现的生物信息学应用产生重大影响。用于读取映射。据我们所知,这是针对2FM索引中的搜索步骤的恒定时间方法的首次实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号