首页> 外文学位 >Bayesian Variable Selection for High Dimensional Data Analysis.
【24h】

Bayesian Variable Selection for High Dimensional Data Analysis.

机译:用于高维数据分析的贝叶斯变量选择。

获取原文
获取原文并翻译 | 示例

摘要

In the practice of statistical modeling, it is often desirable to have an accurate predictive model. Modern data sets usually have a large number of predictors. For example, DNA microarray gene expression data usually have the characteristics of fewer observations and larger number of variables. Hence parsimony is especially an important issue. Best-subset selection is a conventional method of variable selection. Due to the large number of variables with relatively small sample size and severe collinearity among the variables, standard statistical methods for selecting relevant variables often face difficulties.;The second part of the thesis proposes a Bayesian stochastic variable selection approach for gene selection based on a probit regression model with a generalized singular g-prior distribution for regression coefficients. Using simulation-based MCMC methods for simulating parameters from the posterior distribution, an efficient and dependable algorithm is implemented. It is also shown that this algorithm is robust to the choice of initial values, and produces posterior probabilities of related genes for biological interpretation. The performance of the proposed approach is compared with other popular methods in gene selection and classification via the well known colon cancer and leukemia data sets in microarray literature.;In the third part of the thesis, we propose a Bayesian stochastic search variable selection approach for multi-class classification, which can identify relevant genes by assessing sets of genes jointly. We consider a multinomial probit model with a generalized g-prior for the regression coefficients. An efficient algorithm using simulation-based MCMC methods are developed for simulating parameters from the posterior distribution. This algorithm is robust to the choice of initial value, and produces posterior probabilities of relevant genes for biological interpretation. We demonstrate the performance of the approach with two well- known gene expression profiling data: leukemia data and lymphoma data. Compared with other classification approaches, our approach selects smaller numbers of relevant genes and obtains competitive classification accuracy based on obtained results.;The last part of the thesis is about the further research, which presents a stochastic variable selection approach with different two-level hierarchical prior distributions. These priors can be used as a sparsity-enforcing mechanism to perform gene selection for classification. Using simulation-based MCMC methods for simulating parameters from the posterior distribution, an efficient algorithm can be developed and implemented.
机译:在统计建模的实践中,通常需要具有准确的预测模型。现代数据集通常具有大量的预测变量。例如,DNA微阵列基因表达数据通常具有较少观察和大量变量的特征。因此,简约性尤其重要。最佳子集选择是变量选择的常规方法。由于变量数量大,样本量相对较小,变量之间存在共线性,因此选择相关变量的标准统计方法经常会遇到困难。第二部分,提出了一种基于遗传算法的贝叶斯随机变量选择方法。回归系数具有广义奇异先验分布的概率回归模型。使用基于仿真的MCMC方法从后验分布中模拟参数,实现了一种有效且可靠的算法。还表明该算法对初始值的选择是鲁棒的,并且产生相关基因的后验概率用于生物学解释。通过微阵列文献中众所周知的结肠癌和白血病数据集,将该方法的性能与其他流行方法进行基因选择和分类进行比较。;论文的第三部分,我们提出了一种贝叶斯随机搜索变量选择方法多类分类,可以通过共同评估基因集来识别相关基因。我们考虑回归系数具有广义g优先级的多项式概率模型。开发了一种基于模拟MCMC方法的有效算法,用于从后验分布中模拟参数。该算法对于初始值的选择是鲁棒的,并且产生相关基因的后验概率以用于生物学解释。我们用两个众所周知的基因表达谱数据证明了该方法的性能:白血病数据和淋巴瘤数据。与其他分类方法相比,我们的方法选择了较少数量的相关基因,并根据获得的结果获得了竞争性的分类精度。论文的最后一部分是进一步的研究,提出了一种具有不同两级层次的随机变量选择方法先前的分配。这些先验可以用作稀疏性增强机制来执行基因选择以进行分类。使用基于仿真的MCMC方法从后验分布中模拟参数,可以开发和实现一种有效的算法。

著录项

  • 作者

    Yang, Aijun.;

  • 作者单位

    The Chinese University of Hong Kong (Hong Kong).;

  • 授予单位 The Chinese University of Hong Kong (Hong Kong).;
  • 学科 Biology Biostatistics.;Statistics.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 98 p.
  • 总页数 98
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号