【24h】

A Fast Heuristic Search Algorithm for Finding the Longest Common Subsequence of Multiple Strings

机译:查找多个字符串最长公共子序列的快速启发式搜索算法

获取原文

摘要

Finding the longest common subsequence (LCS) of multiple strings is an NP-hard problem, with many applications in the areas of bioinformatics and computational genomics. Although significant efforts have been made to address the problem and its special cases, the increasing complexity and size of biological data require more efficient methods applicable to an arbitrary number of strings. In this paper, a novel search algorithm, MLCS-A*, is presented for the general case of multiple LCS (or MLCS) problems. MLCS-A* is a variant of the A* algorithm. It maximizes a new heuristic estimate of the LCS in each search step so that the longest common subsequence can be found. As a natural extension of MLCS-A*, a fast algorithm, MLCS-APP, is also proposed to deal with large volume of biological data for which finding a LCS within reasonable time is impossible. The benchmark test shows that MLCS-APP is able to extract common subsequences close to the optimal ones and that MLCS-APP significantly outperforms existing heuristic approaches. When applied to 8 protein domain families, MLCS-APP produced more accurate results than existing multiple sequence alignment methods.
机译:找到多个字符串的最长公共子序列(LCS)是一个NP难题,在生物信息学和计算基因组学领域有许多应用。尽管已为解决该问题及其特殊情况做出了巨大努力,但是生物数据的复杂性和规模不断增长,需要适用于任意数量字符串的更有效方法。在本文中,针对多LCS(或MLCS)问题的一般情况,提出了一种新颖的搜索算法MLCS-A *。 MLCS-A *是A *算法的一种变体。它在每个搜索步骤中最大化了对LCS的新启发式估计,从而可以找到最长的公共子序列。作为MLCS-A *的自然扩展,还提出了一种快速算法MLCS-APP,以处理无法在合理时间内找到LCS的大量生物数据。基准测试表明,MLCS-APP能够提取接近最佳子序列的公共子序列,并且MLCS-APP明显优于现有的启发式方法。当应用于8个蛋白质结构域家族时,MLCS-APP比现有的多序列比对方法产生的结果更准确。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号