首页> 外文学位 >Principal component analysis in high dimensional data: Application for genomewide association studies.
【24h】

Principal component analysis in high dimensional data: Application for genomewide association studies.

机译:高维数据中的主成分分析:在全基因组关联研究中的应用。

获取原文
获取原文并翻译 | 示例

摘要

In genomewide association studies (GWAS), population stratification (PS) is a major confounding factor which causes spurious associations by in ating test statistics. PS refers to differences in allele frequencies by disease status due to systematic differences in ancestry, rather than causal association of genes with disease. PCA is commonly used to infer population structure by computing PC scores, which are subsequently used for control of population stratification.;Even though PCA is now widely used for PS adjustment, there are still challenges for PCA based effective PS control. One common feature of the genomic data is the strong local correlation among adjacent loci/markers caused by linkage disequilibrium (LD). It is known that this local correlation can have a negative effect on estimated PC scores and produce spurious PCs which do not truly reflect underlying population structure. To address this problem, we have employed a shrinkage PCA approach where coefficients are used to down-weight the contribution of highly correlated SNPs in PCA.;Another challenge in PC analysis is choosing which PCs to include as covariates to adjust population stratification. While searching for a reasonable measure for PC selection, we have found the precise relationship between genotype principal components and inflation of association test statistics. Based on this fact, We propose a new approach, called EigenCorr, which selects principal components based on both their eigenvalues and their correlation with the (disease) phenotype. Our approach tends to select fewer principal components for stratification control than does testing of eigenvalues alone, providing substantial computational savings and improvements in power.;Under many circumstances, it is of interest to predict PC scores. Although PC score prediction is commonly used in practice, characteristics of the predicted PC scores have not been systematically studied. Under high dimensional settings we have found that the naive predicted PC scores are systematically biased toward 0, and this phenomenon is largely due to the inconsistency of the sample eigenvalues and eigenvectors. We have extended existing convergence results of sample eigenvalues and eigenvectors and derived asymptotic shrinkage factors. Based on these asymptotic results, we propose the bias-adjusted PC score prediction.
机译:在全基因组关联研究(GWAS)中,群体分层(PS)是一个主要的混杂因素,它通过检测测试统计数据而导致虚假关联。 PS是指由于祖先系统的差异而由疾病状况引起的等位基因频率差异,而不是基因与疾病的因果关系。 PCA通常用于通过计算PC分数来推断人口结构,随后将其用于控制​​人口分层。即使PCA现在已广泛用于PS调整,但基于PCA的有效PS控制仍然存在挑战。基因组数据的一个共同特征是由连锁不平衡(LD)引起的相邻基因座/标记之间的强局部相关性。众所周知,这种局部相关性可能会对估计的PC分数产生负面影响,并产生无法真正反映底层人口结构的伪造PC。为了解决这个问题,我们采用了收缩PCA方法,其中系数用于降低PCA中高度相关的SNP的贡献。PC分析的另一个挑战是选择将哪些PC作为协变量包括在内以调整人群分层。在寻找合理的PC选择量度时,我们发现了基因型主成分与关联检验统计量膨胀之间的精确关系。基于这一事实,我们提出了一种称为EigenCorr的新方法,该方法基于特征值及其与(疾病)表型的相关性来选择主要成分。与仅测试特征值相比,我们的方法倾向于选择较少的主成分进行分层控制,从而节省了大量计算量并提高了性能。在许多情况下,预测PC分数很有意义。尽管在实践中通常使用PC分数预测,但是尚未系统地研究预测PC分数的特征。在高维设置下,我们发现幼稚的预测PC分数系统地偏向0,这种现象很大程度上是由于样本特征值和特征向量的不一致所致。我们扩展了样本特征值和特征向量的现有收敛结果,并推导了渐近收缩因子。基于这些渐近结果,我们提出了偏差调整后的PC得分预测。

著录项

  • 作者

    Lee, Seunggeun.;

  • 作者单位

    The University of North Carolina at Chapel Hill.;

  • 授予单位 The University of North Carolina at Chapel Hill.;
  • 学科 Biology Biostatistics.;Statistics.;Biology Bioinformatics.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 133 p.
  • 总页数 133
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号