...
首页> 外文期刊>BMC Genomics >Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
【24h】

Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case

机译:具有长读数据的叶绿体基因组的组装:使用桉树Pauciflora作为测试用例的方法比较

获取原文

摘要

Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10-30?kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942?bp, contains 131 genes of known function. Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.
机译:叶绿体是在植物和藻类细胞中进行光合作用的细胞器。含有叶绿体基因组的信息广泛用于农业和生态学研究。正确组装叶绿体基因组可能是挑战性的,因为叶绿体基因组含有一对长倒置重复(10-30μkKB)。通常,简单地假设叶绿体基因组的总结构与由一对倒置的重复分开的两个单拷贝区域的最常见的两个单拷贝区域的结构相匹配。长读测序技术的出现应通过提供足够的信息来完全跨越反转的重复区域来消除对这种假设的需要。然而,长读数往往比短读取更高的误差率,并且了解相对较少的最佳方法,以获得长读,并且短读取以获得最准确的叶绿体基因组组件。使用桉树Pauciflora,雪胶作为测试用例,我们评估了多种参数的效果,例如长(牛津纳米孔)和短(Illumina)读取,不同的长读长度,不同装配管道的不同覆盖率,以确定最准确和有效的叶绿体基因组组装方法。混合组件结合至少20倍的长读和短读取的覆盖率,产生跨越整个叶绿体基因组的单个角色,少数或没有可检测的误差。短只读组件产生了叶绿体基因组的三个折叠(长单拷贝,短单拷贝和倒置的重复区域)。这些contigs包含了几个单基错误,但倾向于排除每个Contig的开头或结尾的若干基础。长只读程序集倾向于创建具有更高的单次误差率的多个Contig。桉树益植物的叶绿体基因组是159,942μlbp,含有131个已知功能。我们的结果表明,可以使用至少20倍的长读数的覆盖率的组合来实现叶绿体基因组的非常精确的组件,条件是长读取含有比倒置重复区域长度至少约为5倍的读数。我们表明,覆盖范围的进一步增加,准确性少或没有提高,而且混合动力组件比长读或短只读组件更准确。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号