首页> 外文学位 >Algorithms for Determining Differentially Expressed Genes and Chromosome Structures From High-Throughput Sequencing Data.
【24h】

Algorithms for Determining Differentially Expressed Genes and Chromosome Structures From High-Throughput Sequencing Data.

机译:从高通量测序数据确定差异表达基因和染色体结构的算法。

获取原文
获取原文并翻译 | 示例

摘要

Next-generation sequencing (NGS) technologies are able to sequence DNA or RNA molecules at unprecedented speed and with high accuracy. Recently, NGS technologies have been applied in a variety of contexts, e.g., whole genome sequencing, transcript expression profiling, chromatin immunoprecipitation sequencing, and small RNA sequencing, to accelerate genomic researches. The size of NGS data is usually gigantic such that the data analysis in these applications of NGS largely relies on efficient computational methods. Due to the critical demand for high performance computational algorithms, in the past few years, my research interest was focused on designing novel algorithms to address challenges in NGS data analysis. The main theme of this dissertation includes algorithmic solutions to three crucial problems in NGS data analysis, two arising from differential expression analysis using high-throughput mRNA sequencing (RNA-Seq) and the other from chromosome structure capture using high-throughput DNA sequencing (Hi-C). (1) In differential expression analysis of RNA-Seq data, long or highly expressed genes are more likely to be detected by most of existing computational methods. However, such bias against short or lowly expressed genes may distort down-stream data analysis at system biology level. To further improve the sensitivity to short or lowly expressed genes, we designed a new computational tool, called MRFSeq, to combine both gene coexpression and RNA-Seq data. The performance of MRFSeq was carefully assessed using simulated and real benchmark datasets and the experimental results showed that MRFSeq was able to provide more accurate prediction in calling differentially expressed genes than the other existing methods such that the distortion due to the bias against short and lowly expressed genes was significantly alleviated. (2) Most of the existing differential expression analysis tools are developed for comparing RNA-Seq samples between known biological conditions. However, the differential expression analysis is also important to other biological researches where the predefined conditions of samples are not available as a priori. For example, differential expressed transcripts can be used as biomarkers to classify a cohort of cancer samples into subtypes such that better diagnosis and therapy methods can be developed for each subtype. So, the first computational method, called SDEAP, was proposed to identify differential expressed genes and their alternative splicing events without the requirement of the predefined conditions. SDEAP provided accurate prediction in our experiments on simulated and real datasets. The utility of SDEAP was further demonstrated by classifying subtypes of breast cancer, cell types and the cycle phases of mouse cells. (3) Chromosome structures in nucleus play important roles in biological processes of cells. The Hi-C technology allows biology researchers to reconstruct the three dimensional structures of chromosomes in nucleus of cells on a genome-wide scale and thus serves as a vital component in studies of chromosome structures. During the experimental steps of Hi-C, systematic biases may be introduced into Hi-C data. Hence, eliminating the systematic biases is essential to all the applications using Hi-C data. We developed an improved bias reduction algorithm, called GDNorm. By taking advantages of a Poisson regression model that explicitly formulates the causal relationship of Hi-C data, systematic biases and spatial distances in chromosome structures, our experimental results showed that GDNorm was able to remove the biases from Hi-C data such that the corrected Hi-C data could lead to accurate reconstruction of chromosome structures. In the near future, with the rapid accumulation of NGS data, we expect these efficient computational methods to become valuable tools for discovering novel biological knowledge and benefit numerous genomic researches.
机译:下一代测序(NGS)技术能够以前所未有的速度和高精度对DNA或RNA分子进行测序。最近,NGS技术已在各种情况下应用,例如全基因组测序,转录本表达谱,染色质免疫沉淀测序和小RNA测序,以加速基因组研究。 NGS数据的大小通常是巨大的,因此NGS在这些应用程序中的数据分析在很大程度上依赖于有效的计算方法。由于对高性能计算算法的迫切需求,在过去的几年中,我的研究兴趣集中在设计新颖的算法以应对NGS数据分析中的挑战。本文的主要主题是针对NGS数据分析中的三个关键问题的算法解决方案,其中两个是通过使用高通量mRNA测序(RNA-Seq)进行差异表达分析而产生的,另一个是通过使用高通量DNA测序来捕获染色体结构(Hi -C)。 (1)在RNA-Seq数据的差异表达分析中,大多数现有计算方法更可能检测到长表达或高表达的基因。但是,这种对短或低表达基因的偏见可能会使系统生物学水平的下游数据分析失真。为了进一步提高对短表达或低表达基因的敏感性,我们设计了一种称为MRFSeq的新计算工具,以结合基因共表达和RNA-Seq数据。 MRFSeq的性能已使用模拟和真实基准数据集进行了仔细评估,实验结果表明,MRFSeq在调用差异表达基因方面比其他现有方法能够提供更准确的预测,从而使得针对短而低表达的偏倚导致的失真基因显着减轻。 (2)大多数现有的差异表达分析工具都是为比较已知生物学条件之间的RNA-Seq样品而开发的。但是,差异表达分析对其他无法预先确定样品条件的生物学研究也很重要。例如,差异表达的转录本可以用作生物标记,以将一组癌症样品分类为亚型,从而可以为每种亚型开发更好的诊断和治疗方法。因此,提出了第一种计算方法,称为SDEAP,无需预先定义的条件即可识别差异表达基因及其替代的剪接事件。 SDEAP在我们的模拟和真实数据集实验中提供了准确的预测。通过对乳腺癌的亚型,细胞类型和小鼠细胞周期进行分类,进一步证明了SDEAP的实用性。 (3)细胞核的染色体结构在细胞的生物学过程中起着重要的作用。 Hi-C技术使生物学研究人员能够在全基因组范围内重建细胞核中染色体的三维结构,从而成为研究染色体结构的重要组成部分。在Hi-C的实验步骤中,可能会将系统性偏差引入Hi-C数据中。因此,消除系统偏差对于使用Hi-C数据的所有应用至关重要。我们开发了一种改进的偏差减少算法,称为GDNorm。利用Poisson回归模型的优势,该模型明确地表达了Hi-C数据,染色体结构中的系统偏差和空间距离的因果关系,我们的实验结果表明,GDNorm能够从Hi-C数据中消除偏差,从而校正Hi-C数据可能会导致染色体结构的准确重建。在不久的将来,随着NGS数据的迅速积累,我们期望这些有效的计算方法将成为发现新的生物学知识并有益于众多基因组研究的有价值的工具。

著录项

  • 作者

    Yang, Yi-Wen.;

  • 作者单位

    University of California, Riverside.;

  • 授予单位 University of California, Riverside.;
  • 学科 Computer science.;Genetics.;Bioinformatics.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 148 p.
  • 总页数 148
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:52:42

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号