首页> 外文学位 >Bayesian Computation for High-Dimensional Continuous & Sparse Count Data
【24h】

Bayesian Computation for High-Dimensional Continuous & Sparse Count Data

机译:高维连续和稀疏计数数据的贝叶斯计算

获取原文
获取原文并翻译 | 示例

摘要

Probabilistic modeling of multidimensional data is a common problem in practice. When the data is continuous, one common approach is to suppose that the observed data are close to a lower-dimensional smooth manifold. There are a rich variety of manifold learning methods available, which allow mapping of data points to the manifold. However, there is a clear lack of probabilistic methods that allow learning of the manifold along with the generative distribution of the observed data. The best attempt is the Gaussian process latent variable model (GP-LVM), but identifiability issues lead to poor performance. We solve these issues by proposing a novel Coulomb repulsive process (Corp) for locations of points on the manifold, inspired by physical models of electrostatic interactions among particles. Combining this process with a GP prior for the mapping function yields a novel electrostatic GP (electroGP) process.;Another popular approach is to suppose that the observed data are closed to one or a union of lower-dimensional linear subspaces. However, popular methods such as probabilistic principal component analysis scale poorly computationally. We introduce a novel empirical Bayesian method that we term geometric density estimation (GEODE), which assumes the data is centered near a low-dimensional linear subspace. We show that, with mild assumptions on the prior, the subspace spanned by the principal axes of the data maximizes the posterior mode. Hence, leveraged on the geometric information of the data, GEODE easily scales to massive dimensional problems. It is also capable of learning the intrinsic dimension via a novel shrinkage prior. Finally we mix GEODE across a dyadic clustering tree to account for nonlinear cases.;When data is discrete, a common strategy is to define a generalized linear model (GLM) for each variable, with dependence in the different variables induced through including multivariate latent variables in the GLMs. The Bayesian inference for these models usually rely on data augmented Markov chain Monte Carlo (DA-MCMC) method, which has a provable slow mixing rate when the data is imbalanced. For more scalable inference, we proposes Bayesian mosaic, a parallelizable composite posterior, for scalable Bayesian inference on a subclass of the multivariate discrete data models. Sampling is embarrassingly parallel since Bayesian mosaic is a multiplication of component posteriors that can be independently sampled from. Analogous to composite likelihood methods, these component posteriors are based on univariate or bivariate marginal densities. Utilizing the fact that the score functions of these densities are unbiased, we have shown that Bayesian mosaic is consistent and asymptotically normal under mild conditions. Since the evaluation of univariate or bivariate marginal densities could be done via numerical integration, sampling from Bayesian mosaic completely bypasses the traditional data augmented Markov chain Monte Carlo (DA-MCMC) method. Moreover, we have shown that sampling from Bayesian mosaic also has better scalability to large sample size than DA-MCMC. The performance of the proposed methods and models will be demonstrated via both simulation studies and real world applications.
机译:多维数据的概率建模是实践中的常见问题。当数据连续时,一种常见的方法是假设观察到的数据接近于低维平滑流形。有多种可用的流形学习方法,这些方法允许将数据点映射到流形。但是,显然缺乏概率方法,这些方法无法学习流形以及观测数据的生成分布。最佳尝试是高斯过程潜变量模型(GP-LVM),但是可识别性问题导致性能不佳。我们通过提出一种新颖的库仑排斥过程(Corp)来解决这些问题,该过程受粒子间静电相互作用的物理模型的启发而为歧管上的点定位。将该过程与先于GP进行映射的功能相结合,将产生一种新颖的静电GP(electroGP)过程。另一种流行的方法是假设观测到的数据接近一个或一个低维线性子空间的并集。但是,诸如概率主成分分析之类的流行方法在计算上的伸缩性很差。我们介绍了一种新颖的经验贝叶斯方法,我们称其为几何密度估计(GEODE),它假定数据位于低维线性子空间附近。我们显示,在对先验进行温和假设的情况下,由数据主轴跨越的子空间使后验模式最大化。因此,利用数据的几何信息,GEODE可以轻松扩展到大规模的尺寸问题。它还能够通过新颖的收缩来学习内在尺寸。最后,我们在二元聚类树上混合GEODE以解决非线性情况。当数据是离散的时,一种常见的策略是为每个变量定义一个广义线性模型(GLM),依赖于通过包含多变量潜在变量而引起的不同变量在GLM中。这些模型的贝叶斯推断通常依赖于数据增强的马尔可夫链蒙特卡洛(DA-MCMC)方法,当数据不平衡时,该方法具有可证明的慢混合速率。为了获得更多可扩展的推断,我们提出了贝叶斯镶嵌(可并行复合后验),用于多元离散数据模型的子类上的可扩展贝叶斯推断。由于贝叶斯马赛克是可以独立采样的分量后验的乘积,因此采样令人尴尬地平行。类似于复合似然法,这些分量后验基于单变量或双变量边际密度。利用这些密度的分数函数无偏的事实,我们显示了贝叶斯马赛克在温和条件下是一致且渐近正常的。由于可以通过数值积分对单变量或双变量边际密度进行评估,因此从贝叶斯马赛克中进行采样完全绕过了传统的数据增强马尔可夫链蒙特卡洛(DA-MCMC)方法。此外,我们已经证明,与DA-MCMC相比,从贝叶斯马赛克进行采样也具有更好的可扩展性,可扩展到大样本量。拟议的方法和模型的性能将通过仿真研究和实际应用来证明。

著录项

  • 作者

    Wang, Ye.;

  • 作者单位

    Duke University.;

  • 授予单位 Duke University.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 122 p.
  • 总页数 122
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号