首页> 外文期刊>International Journal of Foundations of Computer Science >IDPM: An Improved Degenerate Pattern Matching Algorithm for Biological Sequences
【24h】

IDPM: An Improved Degenerate Pattern Matching Algorithm for Biological Sequences

机译:IDPM:生物序列的改进的简并模式匹配算法

获取原文
获取原文并翻译 | 示例
           

摘要

Let S = s(1)s(2) ... s(n) be a string, with symbols from an alphabet. S is said to be degenerate if for some positions, say i, s(i) can contain a subset of symbols from the symbol alphabet, rather than just one symbol. Given a text string T and a pattern P, both with symbols from an alphabet Sigma, the degenerate string matching problem, is to find positions in T where P occured, such that T, P, or both are allowed to be degenerate. Though some algorithms have been proposed, their huge computational cost pose a significant challenge to their practical utilization. In this work, we propose IDPM, an improved degenerate pattern matching algorithm based on an extension of the Boyer-Moore algorithm. At the preprocessing phase, the algorithm defines an alphabet-independent compatibility rule, and computes the shift arrays using respective variants of the bad character and good suffix heuristics. At the search phase, IDPM improves the matching speed by using the compatibility rule. On average, the proposed IDPM algorithm has a linear time complexity with respect to the text size, and to the overall size of the pattern. IDPM demonstrates significance performance improvement over state-of-the-art approaches. It can be used in fast practical degenerate pattern matching with large data sizes, with important applications in flexible and scalable searching of huge biological sequences.
机译:设S = s(1)s(2)... s(n)是一个字符串,符号来自字母表。 S据说是为了一些职位来堕落,说I,S(i)可以包含来自符号字母表的符号子集,而不是一个符号。给定文本字符串T和图案P,与字母Sigma的符号均具有符号,退化字符串匹配问题是在发生P发生的位置处找到位置,使得允许T,P或两者是堕落的。虽然已经提出了一些算法,但其巨大的计算成本对其实际利用产生了重大挑战。在这项工作中,我们提出了一种基于Boyer-Moore算法的扩展的改进的退化模式匹配算法。在预处理阶段,该算法定义了独立于字母独立的兼容性规则,并使用坏字符和良好后缀启发式的各个变体计算换档阵列。在搜索阶段,IDPM通过使用兼容性规则来提高匹配速度。平均而言,所提出的IDPM算法具有关于文本大小的线性时间复杂度,以及图案的整体大小。 IDPM展示了最先进的方法的显着性能改善。它可以用于与大数据尺寸的快速实用退化模式匹配,具有柔性和可扩展搜索巨大的生物序列的重要应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号