首页> 外文期刊>BMC Genomics >A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
【24h】

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

机译:长读取的indel和替换误差的混合和可伸缩误差校正算法

获取原文
       

摘要

BACKGROUND:Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads.METHODS:In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base.RESULTS:ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy.CONCLUSION:ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
机译:背景:长读取测序显示了通过提供更完整的组装来克服第二代排序的短长度限制。然而,与短读数相比,它们更高的误差率(例如,13%与1%)和更高的成本(例如,每MBP $ 0.03)的成本更高的误差估算的计算。方法:在本文中,我们呈现一个新的混合误差校正工具,称为Parlech(使用混合方法的并行长读误差校正)。 Parlech的纠错算法本质上分布,有效地利用了高吞吐量闪电短读取序列的K-MER覆盖信息来纠正PACBIO长读序列.Parleech首先从短读取构建DE Bruijn图表,然后替换长读取的长读取的indel错误区域,在短读取的de bruijn图中,它们的相应最宽的路径(或最大敏感路径)。 Parlech然后利用短读取的k-mer覆盖信息,将每个长度读入一系列低覆盖区域,然后是大多数投票来纠正每个替换错误base.results:parlech优于最新状态-ART在Real PacBio数据集上的混合误差校正方法。我们的实验评估结果表明,Parlech可以以准确和可扩展的方式校正大规模的现实数据集。 Parlech可以使用128计算节点纠正Lighers短读(312 GB)的人类基因组Pacbio长读取(312 GB)的诱导误差,而不是在29小时内使用128个计算节点。 Parlech可以使用参考基因​​组对准大型大肠杆菌Pacbio数据集的92%基础,证明其精度。结论:Parlech可以使用数百个计算节点来缩放到测序数据的Tberabytes。所提出的混合误差校正方法是新颖的,并整流在原始的长读取中存在的indel和替换错误,或者通过短读取的新引入。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号