首页> 外文学位 >Kernel-based nonparametric testing in high-dimensional data with applications to gene set analysis.
【24h】

Kernel-based nonparametric testing in high-dimensional data with applications to gene set analysis.

机译:在高维数据中基于核的非参数测试及其在基因组分析中的应用。

获取原文
获取原文并翻译 | 示例

摘要

The ultimate goal of genome-wide association studies (GWAS) is understanding the underlying relationship between genetic variants and phenotype. While the heretability is largely missing in univariate analysis of traditional GWAS, it is believed that the joint analysis of variants, that are interactively functioning in a biological pathway (gene set), is more beneficial in detecting association signals. With the fast developing pace of sequencing techniques, more detailed human genome variation will be observed and hence the dimension of variants in the pathway could be extremely high. To model the systematic mechanism and the potential nonlinear interactions among the variants, in this dissertation we propose to model the set effect though a flexible non-parametric function under the high-dimensional setup, which allows the dimension goes to infinity as the size goes to infinity.;Chapter 2 considers testing a nonparametric function of high-dimensional variates in a reproducing kernel Hilbert space (RKHS), which is a function space generated by a positive definite or semidefinite kernel function. We propose a test statistic to test the nonparametric function under the high-dimensional setting. The asymptotic distributions of the test statistic are derived under the null hypothesis and a series of local alternative hypotheses, the explicit power formula under which are also provided. We also develop a novel kernel selection procedure to maximize the power of the proposed test, as well as a kernel regularization procedure to further improve power. Extensive simulation studies and a real data analysis were conducted to evaluate the performance of the proposed method.;Chapter 3 is theoretical investigation on the statistical optimality of kernel-based test statistic under the high-dimensional setup, from the minimax point of view. In particularly, we consider a high-dimensional linear model as the initial study. Unlike the sparsity or independence assumptions existing in related literature, we discussed the minimax properties under a structure free setting. We characterize the boundary that separates the testable region from the non-testable region, and show the rate-optimality of the kernel-based test statistic, under certain conditions on the covariance matrix and the growing speed of dimension.;Our work in Chapter 4 fills the blank of kernel-based test using multiple candidate kernels under the high dimensional setting. Firstly, we extend the test statistic proposed in Chapter 2 to an inclusive form that allows the adjustment of covariants. The asymptotic distribution of the new test statistic under the null hypothesis is then provided. Two practical and efficient strategies are developed to incorporate multiple kernel candidates into the testing procedures. Through comprehensive simulation studies we show that both strategies can calibrates the type I error rate and improve the power over the poor choice of kernel candidate in the set. Particularly, the maximum method, one of the two strategies, is shown having potential to boost the power close to one using the best candidate kernel. An application to Thai baby birth weight data further demonstrates the merits of our proposed methods.
机译:全基因组关联研究(GWAS)的最终目标是了解遗传变异与表型之间的潜在关系。尽管在传统GWAS的单变量分析中缺少可遗传性,但是可以相信,对在生物途径(基因组)中具有交互作用的变体进行联合分析,对于检测关联信号更为有利。随着测序技术的快速发展,将观察到更详细的人类基因组变异,因此该途径中变异体的尺寸可能非常高。为了建模变体之间的系统机制和潜在的非线性相互作用,本文提出在高维设置下通过灵活的非参数函数对设置效果进行建模,这可以使尺寸随着尺寸的变化而达到无穷大。第2章考虑在再现内核希尔伯特空间(RKHS)中测试高维变量的非参数函数,该函数是由正定或半定内核函数生成的函数空间。我们提出了一个测试统计量,以在高维设置下测试非参数函数。检验统计量的渐近分布是在原假设和一系列局部替代假设下得出的,还提供了显式幂公式。我们还开发了一种新颖的内核选择程序,以最大程度地提高所提出的测试的功能,以及一种内核正则化程序,以进一步提高性能。进行了广泛的仿真研究和真实数据分析,以评估所提出方法的性能。第三章是从最小极大值的角度对高维设置下基于核的测试统计量的统计最优性进行理论研究。特别是,我们将高维线性模型作为初始研究。与相关文献中存在的稀疏性或独立性假设不同,我们讨论了无结构设置下的极大极小性质。我们描述了将可测试区域与不可测试区域分隔开的边界,并在一定条件下对协方差矩阵和维数增长速度展示了基于核的测试统计量的速率最优性。在高维设置下使用多个候选内核填补了基于内核的测试的空白。首先,我们将第2章中提出的检验统计量扩展到一个包容性形式,以允许调整协变量。然后提供了原假设下新检验统计量的渐近分布。开发了两种实用且有效的策略,以将多个候选内核合并到测试过程中。通过全面的仿真研究,我们证明了这两种策略都可以校准I型错误率,并提高对集合中候选核仁选择不当的能力。特别地,示出了两种方法中的一种的最大方法具有使用最佳候选内核将功率提高到接近一种的潜力。泰国婴儿出生体重数据的应用进一步证明了我们提出的方法的优点。

著录项

  • 作者

    He, Tao.;

  • 作者单位

    Michigan State University.;

  • 授予单位 Michigan State University.;
  • 学科 Statistics.;Biostatistics.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 115 p.
  • 总页数 115
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:52:18

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号