首页> 外文学位 >Statistical learning applied to transcriptional regulation in small N, large D domains.
【24h】

Statistical learning applied to transcriptional regulation in small N, large D domains.

机译:统计学习应用于小N大D域的转录调控。

获取原文
获取原文并翻译 | 示例

摘要

The last 15 years have witnessed an explosion of high-throughput biological data, where huge numbers of variables (dimensions) are measured for each patient or organism (sample). The most important examples include DNA and RNA sequencing and microarrays. Early hopes that access to such huge volumes of information would revolutionize the field have been moderated by the difficulties of analyzing such data. There is usually no feasible way to interpret this data manually, so statistical learning is typically used. An important limitation, though, is the "small N, large D" problem: Examining a huge number of dimensions with limited sample size increases the occurrence of spurious results due to chance and provides limited ability to infer complex interactions. This thesis focuses on improving statistical learning methodology with respect to high-throughput biological data in three specific areas.;The first area is inference of phenomenological gene regulatory network, or determining what genes will be affected by perturbing the expression of a given gene. This is done by integrating high-throughput cytosine methylation data, which has recently become available and has not been previously used, with mRNA expression data. Bayesian networks are then used to infer directed regulatory networks. The method developed is termed IDEM, for Identification of Direction from Expression and Methylation.;A related area is mechanistic gene regulatory networks, where the focus is on gene regulation due to direct interactions between transcription factor proteins and the DNA sequence near their target genes. The subproblem examined in this thesis is de novo motif discovery. It is demonstrated that commonly used generative models of "random" DNA sequence are "too null" and fail to capture important properties of "random" DNA. This motivates a discriminative approach. This approach is difficult, though, because the sample size is effectively limited to the number of coregulated genes or the number of genes to which a given transcription factor binds, whereas the number of possible bindig motifs is enormous. The dimensionality can be several thousand nucleotides. It is shown that, when properly validated, discriminative approaches perform very poorly. Finally, an adjusted logistic regression, or ALR, method is developed to mitigate weaknesses identified in prior methods.;Lastly, a classifier for tumor sites of origin is created by aggregating publicly available data from over 100 studies, therefore increasing sample size to the point where robust prediction is feasible. It is demonstrated that including a large number of studies in the training data mitigates batch and study effects. The accuracy of several classification techniques, including a novel one based on decision trees of top scoring pairs (TSPs), is compared. Finally, it is shown that preserving cross-study diversity of samples is even more important than preserving sample size and the degree to which ordinary cross-validation is overoptimistic relative to cross-study validation is quantified.;Overall, we demonstrate the importance of tailoring learning to the underlying biology, available sample size and appropriate null hypothesis.
机译:在过去的15年中,见证了高通量生物学数据的爆炸式增长,其中对每个患者或生物体(样本)测量了大量的变量(维度)。最重要的例子包括DNA和RNA测序以及微阵列。分析此类数据的困难减轻了人们对获得如此大量信息将彻底改变该领域的早期希望。通常没有可行的方法来手动解释此数据,因此通常使用统计学习。但是,一个重要的限制是“小N大D”问题:由于样本数量有限,检查大量维度会增加由于偶然性导致的虚假结果的发生,并且推断复杂相互作用的能力有限。本论文的重点是在三个特定领域针对高通量生物学数据改进统计学习方法。第一个领域是现象学基因调控网络的推论,或确定扰动给定基因表达将影响哪些基因。这是通过将高通量胞嘧啶甲基化数据与mRNA表达数据相结合来完成的,该数据最近已经可用并且以前尚未使用。然后使用贝叶斯网络来推断定向监管网络。开发的方法称为IDEM,用于识别表达和甲基化的方向。相关领域是机制基因调控网络,由于转录因子蛋白与目标基因附近的DNA序列之间的直接相互作用,因此侧重于基因调控。本文研究的子问题是从头发现主题。已经证明,常用的“随机” DNA序列的生成模型是“太无效”的,并且不能捕获“随机” DNA的重要特性。这激发了区别对待的方法。但是,这种方法很困难,因为样本大小实际上受限于共调节基因的数目或给定转录因子结合的基因的数目,而可能的绑定基序的数目却很大。维度可以是几千个核苷酸。结果表明,如果正确验证,判别方法的效果会很差。最后,开发了一种调整的Logistic回归或ALR方法来缓解现有方法中发现的弱点。最后,通过汇总来自100多个研究的公开可用数据来创建肿瘤起源部位的分类器,从而使样本量增加到一定程度可靠的预测可行的地方。事实证明,在训练数据中包含大量研究会减轻批次和研究的影响。比较了几种分类技术的准确性,其中包括一种基于最高得分对(TSP)决策树的新颖分类技术。最后,表明保留样本的跨研究多样性比保留样本大小甚至更重要,并且量化了普通交叉验证相对于交叉研究验证的过度乐观程度;总的来说,我们证明了剪裁的重要性了解基础生物学,可用样本量和适当的零假设。

著录项

  • 作者

    Simcha, David M.;

  • 作者单位

    The Johns Hopkins University.;

  • 授予单位 The Johns Hopkins University.;
  • 学科 Engineering Biomedical.;Biology Biostatistics.;Biology Bioinformatics.;Biology Genetics.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 157 p.
  • 总页数 157
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号