首页> 外文会议>International workshop on algorithms in bioinformatics >Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads
【24h】

Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads

机译:Cerulean:使用高吞吐量短读和长读的混合程序集

获取原文

摘要

Genome assembly using high throughput data with short reads, arguably, remains an unresolvable task in repetitive genomes, since when the length of a repeat exceeds the read length, it becomes difficult to unambiguously connect the flanking regions. The emergence of third generation sequencing (Pacific Biosciences) with long reads enables the opportunity to resolve complicated repeats that could not be resolved by the short read data. However, these long reads have high error rate and it is an uphill task to assemble the genome without using additional high quality short reads. Recently, Koren et al. 2012 proposed an approach to use high quality short reads data to correct these long reads and, thus, make the assembly from long reads possible. However, due to the large size of both dataset (short and long reads), error-correction of these long reads requires excessively high computational resources, even on small bacterial genomes. In this work, instead of error correction of long reads, we first assemble the short reads and later map these long reads on the assembly graph to resolve repeats. Contribution: We present a hybrid assembly approach that is both computationally effective and produces high quality assemblies. Our algorithm first operates with a simplified version of the assembly graph consisting only of long contigs and gradually improves the assembly by adding smaller contigs in each iteration. In contrast to the state-of-the-art long reads error correction technique, which requires high computational resources and long running time on a supercomputer even for bacterial genome datasets, our software can produce comparable assembly using only a standard desktop in a short running time.
机译:在重复基因组中,使用高通量数据和短读段的基因组组装可以说仍然是不可解决的任务,因为当重复的长度超过读段长度时,很难明确地连接侧翼区域。具有长读取功能的第三代测序技术(太平洋生物科学公司)的出现使人们有机会解决短读取数据无法解决的复杂重复。然而,这些长读取具有较高的错误率,并且在不使用额外的高质量短读取的情况下组装基因组是一项艰巨的任务。最近,Koren等人。 2012年提出了一种方法,该方法使用高质量的短读数据来纠正这些长读,从而使长读组装成为可能。但是,由于两个数据集的大小(短读取和长读取)较大,因此即使在较小的细菌基因组上,对这些长读取的错误校正也需要过高的计算资源。在这项工作中,我们先组装短读段,然后再将这些长读段映射到汇编图上,以解决重复问题,而不是对长读段进行错误校正。贡献:我们提出了一种混合装配方法,该方法在计算上有效并且可以产生高质量的装配。我们的算法首先使用仅由长重叠群组成的汇编图的简化版本进行操作,并通过在每次迭代中添加较小的重叠群来逐步改善汇编。先进的长读取错误校正技术需要超级计算机上甚至细菌基因组数据集的大量计算资源和长运行时间,而我们的软件可以在短时间内仅使用标准台式机就可产生可比的装配时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号