首页> 外文学位 >Techniques for de novo sequence assembly: Algorithms and experimental results.
【24h】

Techniques for de novo sequence assembly: Algorithms and experimental results.

机译:从头序列组装技术:算法和实验结果。

获取原文
获取原文并翻译 | 示例

摘要

The deep sequencing of second generation sequencing technology has enabled us to study complex biological structures, which have multiple DNA units simultaneously such as transcriptomics and metagenomics. Unlike general genome sequence assembly, a DNA unit of these biological structures may have multiple copies with small or substantial structural variations and/or SNPs simultaneously in an experimental sample. Therefore, the deep sequencing is necessary to figure out such variations concurrently.;This dissertation focuses on de novo transcriptome assembly which requires simultaneous assembly of multiple alternatively spliced gene transcripts. In practice, the de novo transcriptome assembly is the only option for studying the transcriptome of organisms that do not have reference genome sequences, and it can also be applied to identify novel transcripts and structural variations in the gene regions of model organisms. We propose WEAV for the de novo transcriptome assembly which consists of two separate processes: clustering and assembly.;WEAV reduces the complexity of RNA-seq dataset by partitioning it into clusters called clustering. WEAV simplify a diverse RNA-seq dataset, which has many genes together, into many, smaller clustered read sets, which have few genes a cluster, in the clustering process. The underlying idea is straightforward. A sequencer samples reads from random place so reads from one gene may have overlaps with others if sequencing depth is enough. The overlaps are the keys to connect reads from one gene. We can transform a dataset into a graph where each read is a node and two reads are connected by an edge when they have an overlap. Each connected component will be a clustered read set. As a result, we can assume that a cluster may have one or few genes; therefore, it will not be mixed.;After this process, WEAV assembles the clustered read set with de Bruijn graph backbone, and a novel error correction process simplify the backbone with a fast mapping tool, PerM. Roughly speaking, WEAV tries to solve the historical Shortest Common Superstring problem with the graph to identify multiple alternatively spliced gene transcripts simultaneously and approaches the problem using Set Cover problem. We propose novel statistical measures to make the NP hard problem manageable. The measures are explainability based on the likelihood of sequences and correctness based on bootstrapping.;We compared WEAV with other assemblers with various, simulated reads. We tested the performance by widely used measures such as specificity, sensitivity, N50, and the length of the longest sequence. After this, we tested WEAV using an experimental dataset having 58.58 million 100bp human brain transcriptome reads. WEAV assembled 156,494 contigs that were longer than 300bp. 96.3% (specificity) of these contigs were mapped onto either RefSeq, Gencode or human Genome sequences (hg19), and they covered >72% sequenced bases annotated in RefSeq and Gencode. These high sensitivity and specificity showed the exceptional power of WEAV for transcriptome assembly.
机译:第二代测序技术的深度测序使我们能够研究复杂的生物结构,这些结构同时具有多个DNA单元,例如转录组学和宏基因组学。与一般的基因组序列装配不同,这些生物结构的DNA单元在实验样品中可能同时具有多个拷贝,这些拷贝具有较小或实质性的结构变异和/或SNP。因此,必须同时进行深度测序才能弄清这种变异。本论文的重点是从头转录组组装,该组装需要同时组装多个交替剪接的基因转录本。在实践中,从头转录组装配是研究没有参考基因组序列的生物的转录组的唯一选择,它也可以用于识别模型生物的基因区域中的新转录本和结构变异。我们提出从头转录组组装的WEAV,它由两个独立的过程组成:聚类和组装。WEAV通过将RNA-seq数据集划分为称为聚类的簇来降低其复杂性。在聚类过程中,WEAV将具有多个基因的多样化RNA-seq数据集简化为许多较小的聚类阅读集,而这些簇几乎没有聚类基因。基本思想很简单。测序仪从随机位置取样读取,因此如果测序深度足够,则从一个基因读取的基因可能与其他基因重叠。重叠是连接一个基因读取的关键。我们可以将数据集转换为图形,其中每个读取是一个节点,两个读取具有重叠时通过一条边连接。每个连接的组件都是一个集群读取集。结果,我们可以假设一个簇可能有一个或几个基因;一个基因可能只有一个或几个。在此过程之后,WEAV用de Bruijn图主干组装了集群的读取集,新的纠错过程通过快速映射工具PerM简化了主干。粗略地说,WEAV试图用图来解决历史上最短的公共超串问题,以同时识别多个交替剪接的基因转录本,并使用Set Cover问题解决该问题。我们提出了新颖的统​​计方法来使NP难题得以管理。这些措施是基于序列可能性的可解释性和基于自举的正确性。我们将WEAV与具有各种模拟读数的其他汇编程序进行了比较。我们通过广泛使用的指标(例如特异性,敏感性,N50和最长序列的长度)测试了性能。此后,我们使用实验数据集对WEAV进行了测试,该数据集具有5858万个100bp的人脑转录组读段。 WEAV组装了156,494个重叠群,长度超过300bp。这些重叠群的96.3%(特异性)被定位到RefSeq,Gencode或人类基因组序列(hg19)上,它们覆盖了RefSeq和Gencode中注释的> 72%的测序碱基。这些高灵敏度和特异性显示了WEAV在转录组装配中的强大功能。

著录项

  • 作者

    Cho, Sungje.;

  • 作者单位

    University of Southern California.;

  • 授予单位 University of Southern California.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 122 p.
  • 总页数 122
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号