首页> 外文OA文献 >Controlling For Hidden Factors In High Dimensional Eqtl Studies
【2h】

Controlling For Hidden Factors In High Dimensional Eqtl Studies

机译:高维方程研究中隐藏因素的控制

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Finding genetic variants that regulate gene expression now plays a central role in the analysis of mechanism in biological systems. This will also increasingly be the case as large amounts of gene expression and genetic marker data are generated by next-generation sequencing technologies. While the unprecedented scale of these data is providing the opportunity for scientists to answer basic questions about biological systems, the properties of these data raise analysis challenges, particularly in terms of covariate modeling. For example, expression levels of thousands of genes are usually measured in batches and different batches may be measured under different conditions, which creates the well known batch effect. Besides this artificially created factor that can affect the quality of the measurement, expression data often reflect environmental regulators that change the gene expression levels, such as smoking, drug usage etc.. These sources of confounding need to be addressed either before or during analysis of data. In this thesis, I address the analysis issues raised by a particular type of confounding in high-dimensional data: hidden factor effects. Hidden factors are defined as factors that contribute to variation in a large number of measured variables where there is no direct information concerning the factors in the data. It is critical to correct for the hidden factors because if ignored, they can lead to either high false positive rates or reduced power. To tackle this issue, I propose to use a statistical model that combines multivariate ridge regression and factor analysis to infer both the fixed effects and the hidden confounding. The method is unique in the sense that it employs the multivariate regression components to infer the associations between the response Y and the covariate X, while it maintains efficiency by sharing the same data reduction property with the factor analysis model. Compared to other models that address the same issue, this model can successfully partition the covariance structure of the hidden factors, which dramatically improves the power and the accuracy of detecting the real associations between X and Y. I also used the model to address the hidden factors issues in the analysis of data on gene expression levels measured in the airway of the lung in a sample of people, in the context of a genome association study, referred to as an expression Quantitative Trait Loci (eQTL) analysis. I show that the method successfully eliminates the false positives caused by spurious structures (hidden factors) and greatly improves the power to detect true genetic determinants (the eQTL) that regulate gene expression in the lung airway. I also apply the method to a challenging Genotype-Environment Interaction (GEI) analysis, where GEI effects are defined as the dependence of genotype-phenotype relationships on environmental factors. I show that despite the small sample size and the highly complicated data structure, with my method, I can identify a large number of interesting GEI associations, many have been verified indepently by other studies to be highly relevant genes to lung disease and lung functions. These GEI associations contain more information than a typical eQTL because they help to identify genetic regulators that show different behavior under different environmental pressures, which serve as an interesting set of gene candidates for clinical scientists.
机译:寻找调节基因表达的遗传变异现在在生物系统机理分析中起着核心作用。随着下一代测序技术产生大量的基因表达和遗传标记数据,情况也将越来越多。尽管这些数据的空前规模为科学家提供了回答有关生物系统的基本问题的机会,但这些数据的性质提出了分析挑战,尤其是在协变量建模方面。例如,通常成批地测量数千个基因的表达水平,并且可以在不同条件下测量不同的批次,这产生了众所周知的批次效应。除了这种可能会影响测量质量的人为因素外,表达数据还经常反映出环境调节剂,它们会改变基因表达水平,例如吸烟,吸毒等。这些混杂的原因需要在分析之前或过程中加以解决。数据。在本文中,我将解决由高维数据中的一种特殊类型的混杂所引起的分析问题:隐藏因素影响。隐藏因素定义为在没有直接有关数据因素的直接信息的情况下,导致大量测量变量发生变化的因素。纠正隐藏因素非常重要,因为如果忽略这些隐藏因素,它们可能导致高误报率或降低功率。为了解决这个问题,我建议使用一个统计模型,该模型结合了多元岭回归和因子分析来推断固定效应和隐含的混淆。该方法的独特之处在于,它使用多元回归分量来推断响应Y和协变量X之间的关联,同时通过与因子分析模型共享相同的数据约简属性来保持效率。与解决相同问题的其他模型相比,该模型可以成功地划分隐藏因素的协方差结构,从而显着提高检测X和Y之间真实关联的能力和准确性。我还使用该模型来解决隐藏问题在基因组关联研究的背景下,在样本中的肺气道中测得的基因表达水平数据的分析中,一些因素会引起影响,这被称为表达定量性状位点(eQTL)分析。我证明了该方法成功消除了由假结构(隐藏因素)引起的误报,并大大提高了检测调节肺气道基因表达的真正遗传决定因素(eQTL)的能力。我还将这种方法应用于具有挑战性的基因型-环境相互作用(GEI)分析,其中GEI效应定义为基因型与表型关系对环境因素的依赖性。我表明,尽管样本量小且数据结构高度复杂,但通过我的方法,我仍可以识别出许多有趣的GEI关联,其他研究已独立地证实了许多与肺疾病和肺功能高度相关的基因。这些GEI关联比典型的eQTL包含更多的信息,因为它们有助于识别在不同环境压力下表现出不同行为的遗传调节剂,这是临床科学家感兴趣的一组基因候选物。

著录项

  • 作者

    Gao Chuan;

  • 作者单位
  • 年度 2012
  • 总页数
  • 原文格式 PDF
  • 正文语种 en_US
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号