Simultaneous identification of long similar substrings in large sets of sequences

Jürgen Kleffe; Friedrich M?ller; Burghardt Wittig

首页> 外文期刊>BMC Bioinformatics >Simultaneous identification of long similar substrings in large sets of sequences

【24h】

Simultaneous identification of long similar substrings in large sets of sequences

机译：大套序列中的长相似子串的同时识别

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at http://www.medicago.org/genome/assembly_table.php?chr=1 . Conclusion The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.

机译：背景技术序列比较面临新的挑战，许多完整的基因组和已知的转录物的大型文库。基因注释管道匹配这些序列以鉴定基因及其替代剪接形式。但是，目前可用的软件不能同时将一组序列与必要的序列相比，特别是如果必须考虑错误。结果我们展示了一种新的算法，用于识别非常大的序列中几乎完美匹配的子网格。它的实现称为ClustDB，比vmatch更快，可以处理16倍的数据，这是今天已知的最多记忆有效的精确程序。 CLUSTDB同时生成给定最小长度的大组与具有错误匹配扩展方法的种子的种子。它在给定大小的每个重叠窗口中，在每个重叠窗口内产生最大长度的对齐。这种对准在通常的意义上不是最佳的，但是比传统的基因组序列比较，EST和全长cDNA匹配和基因组序列组件的传统比对更快地进行计算和往往更适当。该方法用于检查重叠并揭示在http://www.medicago.org/genome/assembly_table.php?chr=1上发布的1377 medicago truncatula bac大小序列的可能组装误差。结论程序CLUSTDB证明，窗口对准是在随机误差的情况下，如预期的那样找到均匀对准质量的长序列部分的有效方法，并检测由序列污染造成的系统误差。通过仅通过仅调整不匹配和间隙的调整惩罚来系统地被系统地忽略了这种插入物。 CLUSTDB自由地用于学术用途。

著录项

来源
《BMC Bioinformatics》 |2007年第5期|共页
作者
Jürgen Kleffe; Friedrich M?ller; Burghardt Wittig;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms [J] . Todsanai Chumwatana Journal of Advances in Information Technology . 2016,第4期

机译：使用频繁子串挖掘技术为基因组序列建立索引：频繁子串算法和最大最大子串算法的比较
2. Suffix tree searcher: exploration of common substrings in large DNA sequence sets [J] . David Minkley, Michael J Whitney, Song-Han Lin, BMC research notes . 2014,第1期

机译：后缀树搜索器：探索大型DNA序列集中的常见子串
3. Suffix tree searcher: exploration of common substrings in large DNA sequence sets [J] . David Minkley, Michael J Whitney, Song-Han Lin, BMC research notes . 2014,第1期

机译：后缀树搜索器：探索大型DNA序列集中的常见子串
4. Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences [C] . Daisuke Ikeda International Conference on Parallel and Distributed Processing Techniques and Applications . 2013

机译：从一组生物序列中采集两个频繁子串的异常模式
5. Discovering motifs in DNA and protein sequences: The approximate common substring problem. [D] . Bailey, Timothy Lawrence. 1995

机译：在DNA和蛋白质序列中发现基序：近似的常见子串问题。
6. Simultaneous identification of long similar substrings in large sets of sequences [O] . Jürgen Kleffe, Friedrich Möller, Burghardt Wittig 2007

机译：同时识别大型序列中的长相似子字符串
7. Simultaneous identification of long similar substrings in large sets of sequences [O] . 2007

机译：同时识别大型序列中的长相似子字符串
8. The simultaneous use of several pseudo-random binary sequences in the identification of linear multivariable dynamic systems [R] . J. D. Cummins 1965

机译：在线性多变量动态系统的识别中同时使用几个伪随机二进制序列

Simultaneous identification of long similar substrings in large sets of sequences

摘要

著录项

相似文献

相关主题

期刊订阅