A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Arghya Kusum Das; Sayan Goswami; Kisung Lee; Seung-Jong Park

首页> 外文期刊>BMC Genomics >A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

【24h】

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

机译：长读取的indel和替换误差的混合和可伸缩误差校正算法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

BACKGROUND:Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads.METHODS:In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base.RESULTS:ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy.CONCLUSION:ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

机译：背景：长读取测序显示了通过提供更完整的组装来克服第二代排序的短长度限制。然而，与短读数相比，它们更高的误差率（例如，13％与1％）和更高的成本（例如，每MBP $ 0.03）的成本更高的误差估算的计算。方法：在本文中，我们呈现一个新的混合误差校正工具，称为Parlech（使用混合方法的并行长读误差校正）。 Parlech的纠错算法本质上分布，有效地利用了高吞吐量闪电短读取序列的K-MER覆盖信息来纠正PACBIO长读序列.Parleech首先从短读取构建DE Bruijn图表，然后替换长读取的长读取的indel错误区域，在短读取的de bruijn图中，它们的相应最宽的路径（或最大敏感路径）。 Parlech然后利用短读取的k-mer覆盖信息，将每个长度读入一系列低覆盖区域，然后是大多数投票来纠正每个替换错误base.results：parlech优于最新状态-ART在Real PacBio数据集上的混合误差校正方法。我们的实验评估结果表明，Parlech可以以准确和可扩展的方式校正大规模的现实数据集。 Parlech可以使用128计算节点纠正Lighers短读（312 GB）的人类基因组Pacbio长读取（312 GB）的诱导误差，而不是在29小时内使用128个计算节点。 Parlech可以使用参考基因组对准大型大肠杆菌Pacbio数据集的92％基础，证明其精度。结论：Parlech可以使用数百个计算节点来缩放到测序数据的Tberabytes。所提出的混合误差校正方法是新颖的，并整流在原始的长读取中存在的indel和替换错误，或者通过短读取的新引入。

著录项

来源
《BMC Genomics》 |2019年第s11期|共15页
作者
Arghya Kusum Das; Sayan Goswami; Kisung Lee; Seung-Jong Park;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
Hybrid error correctionPacBioIlluminaHadoopNoSQL;

机译：混合误差矫正术术术术术淘脂粉;
入库时间 2022-08-19 01:02:00

相似文献

外文文献
中文文献
专利

1. Hercules: a profile HMM-based hybrid error correction algorithm for long reads [J] . Firtina Can, Bar-Joseph Ziv, Alkan Can, Nucleic Acids Research . 2018,第21期

机译：Hercules：长读取的基于HMM的混合误差校正算法
2. Hercules: a profile HMM-based hybrid error correction algorithm for long reads [J] . Can Firtina, Ziv Bar-Joseph, Can Alkan, Nucleic acids research . 2018,第21期

机译：Hercules：一种基于HMM的配置文件的混合错误校正算法，可进行长时间读取
3. HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning [J] . Olivia Choudhury, Ankush Chakrabarty, Scott J. Emrich Scientific reports. . 2018,第1期

机译：Hecil：具有迭代学习的长读的混合误差校正算法
4. Identification and correction of substitution errors in Moleculo long reads [C] . Price Jared, Ward Judson, Udall Joshua, IEEE International Conference on Bioinformatics and Bioengineering . 2013

机译：分子长读中取代错误的鉴定和纠正
5. Probabilistic insertion, deletion and substitution error correction using Markov inference in next generation sequencing reads [D] . Noroozi, Vahid 2016

机译：在下一代测序读取中使用马尔可夫推论进行概率插入，删除和取代错误校正
6. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads [O] . Arghya Kusum Das, Sayan Goswami, Kisung Lee, 2019

机译：混合和可扩展的纠错算法用于长读的插入缺失和替换错误
7. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads [O] . Arghya Kusum Das, Sayan Goswami, Kisung Lee, 2019

机译：长读取的indel和替换误差的混合和可伸缩误差校正算法

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

摘要

著录项

相似文献

相关主题

期刊订阅