首页> 外文学位 >Parallel and Cloud Computing Based Genome Assembly using Bi-directed String Graphs.
【24h】

Parallel and Cloud Computing Based Genome Assembly using Bi-directed String Graphs.

机译:使用双向字符串图的基于并行和云计算的基因组组装。

获取原文
获取原文并翻译 | 示例

摘要

Background: Whole Genome Sequencing has been proven to be the one of the most powerful technology in the field of Genetics. It has found numerous applications in fields such as Plants genomics, microbial genomics to advanced Human genomics. And it has been proved to provide the most comprehensive collocation of an individual's genetic variants. Starting with Sanger sequencing, which dominated the industry for nearly two decades, now Whole Genome Sequencing has become more efficient with the advent of Next Generation sequencing (NGS). However, NGS has major limitation in the process of sequencing, which is caused by decrease in Read size. This limitation makes the genome assembly process with NGS data more complicated and dependent of high computational resources. The thesis involves comparison of two assemblers designed for assembling short NGS reads, which is based on a newer De Brujn Graph approach. These assemblers are called Velvet and Contrail. Velvet relies on large memory (RAM) for solving the assembly graph, whereas, Contrail relies on Hadoop Programming framework, for distributing the assembly process in parallel over several nodes. The research involves comparing the various assembly statistics which are obtained after running an assembly pipeline on a given dataset. The research also involves comparison between paired read sequencing and unpaired reads sequencing for the Velvet assembler.;Results: The first phase of analysis involved running assembly over a range of the algorithm parameters for the Kmer length 15-65 on a small set of data (2X coverage) using Velvet and contrail. It was observed that best assembly statistics was obtained by using Kmer size of 65. This Kmer value was then kept fixed for remaining of the experiments. The comparison between paired and unpaired read assembly on a small dataset using Velvet did not show significant difference. However, when applied to a comparatively bigger dataset, paired reads seemed to assemble better than unpaired reads. The comparison between Contrail and Velvet assembler on a small dataset showed that Velvet takes less time to complete. Also, Velvet provides better assembly quality. When the entire dataset of read coverage 192X and data size of about 70Gigabytes was assembled, Velvet failed to complete the assembly process. Contrail, on the other hand took about 240hours, but it did succeed to completion. When the assembly failed for Velvet on the entire dataset, the data was divided into half and then assembled again using Velvet. This time Velvet completed the process. However, Contrail showed much better assembly statistics.;Conclusion: This research abides by the fact that De Brujn Graph approach, definitely, is a more advanced, less complicated and reliable way to assemble short reads NGS sequences. It can be concluded from this research that the Kmer size parameter to use for assembling short reads should be about 65% its read size. At this length the assembly quality is the optimum. When it comes to deciding on which assembler to use, the size of the dataset should be taken into consideration. For a relatively smaller dataset, like those of microbial or small eukaryotic genome, Velvet would be a better option. Because Velvet loads the entire De Brujn Graph on the memory, assembling small microbial or eukaryotic genomes, will not require a large memory computer servers. However, if the dataset is of a mammalian genome, then Velvet would tend to fail, if a really large memory server (more than 1TB) is not used. Because, such servers are expensive and difficult to install, Contrail would be a better solution. Contrail runs on Hadoop, which distributes the assembly over several nodes. Installing and setting up Hadoop could also be expensive and difficult, but it can be rented from Cloud computing providers. Hence, Contrail would provide a simple and cost effective way for de novo assembly of shorts reads which are obtained from large genomes.
机译:背景:全基因组测序已被证明是遗传学领域最强大的技术之一。它已在植物基因组学,微生物基因组学到高级人类基因组学等领域中得到了广泛的应用。事实证明,它可以提供最全面的个人遗传变异配置。从在业界统治了近二十年的Sanger测序开始,随着下一代测序(NGS)的出现,现在全基因组测序已变得更加高效。但是,NGS在测序过程中有很大的局限性,这是由于读大小的减少所致。这种限制使得具有NGS数据的基因组组装过程更加复杂,并且依赖于大量的计算资源。本文涉及两种用于较短NGS读段的汇编程序的比较,这是基于较新的De Brujn Graph方法。这些汇编程序称为Velvet和Contrail。 Velvet依赖大内存(RAM)来解决组装图,而Contrail则依赖Hadoop编程框架,以在多个节点上并行分布组装过程。研究涉及比较在给定数据集上运行装配流水线后获得的各种装配统计信息。该研究还涉及Velvet汇编程序的成对读取测序与未成对读取测序之间的比较。;结果:分析的第一阶段涉及在少量数据集上针对Kmer长度15-65的一系列算法参数运行汇编( 2倍覆盖率)使用天鹅绒和凝结尾迹。据观察,通过使用65的Kmer尺寸可获得最佳的组装统计数据。然后在其余的实验中保持固定的Kmer值。使用Velvet在小型数据集上的配对和未配对读取程序集之间的比较未显示明显差异。但是,当将其应用于相对较大的数据集时,配对读取的汇编似乎比未配对读取的汇编更好。比较Contrail和Velvet汇编程序在一个小型数据集上的结果,表明Velvet花费的时间更少。此外,天鹅绒可提供更好的组装质量。当组装读取覆盖率192X和数据大小约为70G的整个数据集时,Velvet无法完成组装过程。另一方面,Contrail花了大约240个小时,但确实成功完成了。当整个数据集上的Velvet装配失败时,将数据分为两半,然后使用Velvet再次装配。这次Velvet完成了该过程。但是,Contrail显示出更好的汇编统计信息。结论:本研究遵循这样的事实,即De Brujn Graph方法无疑是一种组装短读NGS序列的更高级,更简单和可靠的方法。从这项研究可以得出结论,用于组装短读段的Kmer大小参数应约为其读段大小的65%。在此长度下,装配质量是最佳的。在决定使用哪个汇编程序时,应考虑数据集的大小。对于相对较小的数据集(如微生物或小型真核基因组的数据集),Velvet是更好的选择。由于Velvet将整个De Brujn图加载到内存中,因此可以组装小型微生物或真核生物基因组,因此不需要大型内存的计算机服务器。但是,如果数据集是哺乳动物基因组的数据,那么如果不使用真正的大型内存服务器(大于1TB),Velvet可能会失败。因为此类服务器价格昂贵且难以安装,所以Contrail将是一个更好的解决方案。 Contrail在Hadoop上运行,后者将程序集分布在多个节点上。安装和设置Hadoop也可能既昂贵又困难,但是可以从云计算提供商处租用。因此,Contrail将提供一种从头开始从大基因组获得的短读从头组装的简单且经济高效的方法。

著录项

  • 作者

    Kumari, Priti.;

  • 作者单位

    The George Washington University.;

  • 授予单位 The George Washington University.;
  • 学科 Biology Bioinformatics.
  • 学位 M.S.
  • 年度 2012
  • 页码 44 p.
  • 总页数 44
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号