首页> 外文期刊>Computational Biology and Bioinformatics, IEEE/ACM Transactions on >Fast Exact Algorithms for the Closest String and Substring Problems with Application to the Planted (L,d)-Motif Model
【24h】

Fast Exact Algorithms for the Closest String and Substring Problems with Application to the Planted (L,d)-Motif Model

机译:最接近的字符串和子字符串问题的快速精确算法及其在种植(L,d)-Motif模型中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

We present two parameterized algorithms for the closest string problem. The first runs in O(nL + ndcdot 17.97^d) time for DNA strings and in O(nL + ndcdot 61.86^d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + ndcdot 13.92^d) time for DNA strings and in O(nL + ndcdot 47.21^d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n-1)m^2(L + dcdot 17.97^dcdot m^{lfloor log_2(d+1)rfloor })) time for DNA strings and in O((n-1)m^2(L + dcdot 61.86^dcdot m^{lfloor log_2(d+1)rfloor })) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too.
机译:我们为最接近的字符串问题提供了两种参数化算法。对于DNA字符串,第一次运行时间为O(nL + ndcdot 17.97 ^ d),对于蛋白质字符串,第一次运行时间为O(nL + ndcdot 61.86 ^ d),其中n是输入字符串的数量,L是每个输入字符串的长度,d是中心字符串和每个输入字符串之间不匹配数的给定上限。对于DNA串,第二个运行时间为O(nL + ndcdot 13.92 ^ d),对于蛋白质串,第二个运行时间为O(nL + ndcdot 47.21 ^ d)。然后,我们将第一个算法扩展到针对在O((n-1)m ^ 2(L + dcdot 17.97 ^ dcdot m ^ {lfloor log_2(d + 1)rfloor}))中运行的最接近子字符串问题的新参数化算法DNA字符串的时间,蛋白质字符串的时间为O((n-1)m ^ 2(L + dcdot 61.86 ^ dcdot m ^ {lfloor log_2(d + 1)rfloor}))的时间,其中n是输入字符串的数量,L是中心子字符串的长度,L-1 + m是单个输入字符串的最大长度,d是中心子字符串与每个输入字符串的至少一个子字符串之间不匹配数的给定上限。所有算法都大大提高了以前的最佳性能。为了通过实验验证时间复杂度的理论改进,我们在C中实现了我们的算法,并将所得程序应用于Pevzner和Sze在2000年提出的种植(L,d)-基序问题。我们将程序与以前的最佳精确度进行了比较解决该问题的程序,即PMSPrune(由Davila等人于2007年设计)。我们的实验数据表明,我们的程序在实际情况下以及在一些具有挑战性的情况下运行速度都更快。我们的算法也使用更少的内存。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号