首页> 外文期刊>BMC Bioinformatics >Clustering exact matches of pairwise sequence alignments by weighted linear regression
【24h】

Clustering exact matches of pairwise sequence alignments by weighted linear regression

机译:通过加权线性回归群集成对序列对齐的精确匹配

获取原文
获取外文期刊封面目录资料

摘要

Background At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, specifically, by identifying the starting and ending positions of such a region. This information is very useful in ordering the contigs, facilitating post-assembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can also be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. It is a tedious and practically impossible task to visually inspect the dot plot to identify the regions covered by a large number of contigs from sequence assembly projects. A forced global alignment between a contig and the reference is not only time consuming but often meaningless. Results We have developed an algorithm that uses the coordinates of all the exact matches or high similarity local alignments, clusters them with respect to the main diagonal in the dot plot using a weighted linear regression technique, and identifies the starting and ending coordinates of the region of interest. Conclusion This algorithm complements existing pairwise sequence alignment packages by replacing the time-consuming seed extension phase with a weighted linear regression for the alignment seeds. It was experimentally shown that the gain in execution time can be outstanding without compromising the accuracy. This method should be of great utility to sequence assembly and genome comparison projects.
机译:背景技术在基因组组装项目的中间阶段,当已经产生了多个折叠并且需要验证它们的有效性时,期望在可用时将这些CONDIG对齐至参考基因组。利息不是在基础级别的CONTIG和参考基因组之间的详细对准,而是通过识别这样一个的起始和结束位置,对CONTIG对准的位置进行粗略估计。地区。此信息在订购Contigs时非常有用,便于装配后分析,例如间隙闭合和解决重复。存在诸如BLAST和MUMMER的程序,可以快速对准和识别两个序列之间的高相似段,当在点绘图中看到时,往往沿着对角线倾向于凝聚,但也可以被间隙中断或移开由于CONTIG与参考之间的不匹配,主要对角线。这是一种繁琐而实际上不可能的任务,用于在目视检查点图以识别由序列装配项目覆盖的大量Contigs所涵盖的区域。 CONTIG与参考之间的强制全局对齐不仅耗时,而且往往毫无意义。结果我们开发了一种使用所有精确匹配或高相似度局部对准的坐标,使用加权线性回归技术在点绘图中委托它们的坐标,并识别该区域的起始和结束坐标出于兴趣。结论该算法通过用对准种子的加权线性回归替换耗时的种子扩展阶段来补充现有的成对序列对准包。它实验表明,执行时间的增益可以出现优异而不会影响精度。此方法应具有序列组装和基因组比较项目的巨大实用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号