首页> 外文期刊>BMC Bioinformatics >Simultaneous identification of long similar substrings in large sets of sequences
【24h】

Simultaneous identification of long similar substrings in large sets of sequences

机译:大套序列中的长相似子串的同时识别

获取原文
           

摘要

Background Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at http://www.medicago.org/genome/assembly_table.php?chr=1 . Conclusion The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.
机译:背景技术序列比较面临新的挑战,许多完整的基因组和已知的转录物的大型文库。基因注释管道匹配这些序列以鉴定基因及其替代剪接形式。但是,目前可用的软件不能同时将一组序列与必要的序列相比,特别是如果必须考虑错误。结果我们展示了一种新的算法,用于识别非常大的序列中几乎完美匹配的子网格。它的实现称为ClustDB,比vmatch更快,可以处理16倍的数据,这是今天已知的最多记忆有效的精确程序。 CLUSTDB同时生成给定最小长度的大组与具有错误匹配扩展方法的种子的种子。它在给定大小的每个重叠窗口中,在每个重叠窗口内产生最大长度的对齐。这种对准在通常的意义上不是最佳的,但是比传统的基因组序列比较,EST和全长cDNA匹配和基因组序列组件的传统比对更快地进行计算和往往更适当。该方法用于检查重叠并揭示在http://www.medicago.org/genome/assembly_table.php?chr=1上发布的1377 medicago truncatula bac大小序列的可能组装误差。结论程序CLUSTDB证明,窗口对准是在随机误差的情况下,如预期的那样找到均匀对准质量的长序列部分的有效方法,并检测由序列污染造成的系统误差。通过仅通过仅调整不匹配和间隙的调整惩罚来系统地被系统地忽略了这种插入物。 CLUSTDB自由地用于学术用途。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号