首页> 外文OA文献 >Probabilistic modelling of symbolic data and blocking collapsed Gibbs samplers for topic models
【2h】

Probabilistic modelling of symbolic data and blocking collapsed Gibbs samplers for topic models

机译:符号数据的概率建模和阻止崩溃的Gibbs采样器进行主题模型

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

Symbolic data are distributions constructed from data points. When big datasets can be organised into different groups, one may first summarise each group by a symbol, and then analyse the symbolic dataset directly. By reducing the dataset to a more manageable size, it enables explanatory analysis and statistical inference, which would be impossible for the original large dataset. In the first half of this thesis, we develop a probabilistic approach for constructing likelihood functions for two types of symbolic data, interval-valued data and histogram-valued data. Existing methods ignore the process by which the symbolic data are constructed; namely by the aggregation of real-valued data generated from some underlying process. We develop the foundation of likelihood-based statistical inference for random symbols that directly incorporates the underlying generative procedure into the analysis. It permits the direct fitting of models for the underlying real-valued data given only symbolic summaries. Our approach overcomes several problems associated with existing methods, and can jointly model intra- and inter- symbol variations. The new methods are illustrated by simulated and real data analyses.Latent variable models are powerful tools to extract unobserved features from big datasets. Well-known examples are latent Dirichlet allocation (LDA) and hierarchical Dirichlet process mixtures (HDP-M) for topic modelling. Collapsed Gibbs samplers are routinely used for Bayesian inference for these models due to their superior performance in chain mixing. In the second half of the thesis, we propose a blocking scheme for a collapsed Gibbs sampler for the LDA and HDP-M models which can improve chain mixing efficiency. For the LDA model, we develop an O(log K)-step nested sampling (K is the number of topics) to directly simulate latent variables for each block. To obtain such blocking scheme for the HDP-M model, we introduce residual allocation processes (RAP), which can construct random partitions induced from Dirichlet processes in a class-wise manner, and propose hierarchical RAP for constructing random partitions induced from HDP. Derived from residual allocation constructions, the blocking scheme consists of nested sampling of the latent variables for existing topics and residual allocation sampling of the latent variables for new topics. We demonstrate that the blocking scheme achieves substantial improvements in chain mixing and a significant reduction in computation time.
机译:符号数据是从数据点构造的分布。当大型数据集可以组织为不同的组时,可以先用一个符号总结每个组,然后直接分析符号数据集。通过将数据集缩小到更易于管理的大小,它可以进行解释性分析和统计推断,而这对于原始的大型数据集是不可能的。在本文的上半部分,我们开发了一种概率方法,用于为两种类型的符号数据(区间值数据和直方图值数据)构造似然函数。现有方法忽略了构建符号数据的过程。也就是说,通过汇总一些基础流程生成的实值数据。我们为随机符号建立了基于似然性统计推断的基础,该符号直接将潜在的生成过程纳入分析中。仅给出符号摘要,就可以直接对基础实值数据进行模型拟合。我们的方法克服了与现有方法相关的几个问题,并且可以共同对符号内和符号间的变化进行建模。通过模拟和真实数据分析来说明新方法。潜在变量模型是从大型数据集中提取未观察到的特征的强大工具。著名的示例是潜在的Dirichlet分配(LDA)和用于主题建模的分层Dirichlet过程混合(HDP-M)。折叠的Gibbs采样器由于在链混合中的出色性能,通常用于这些模型的贝叶斯推断。在论文的后半部分,我们为LDA和HDP-M模型提出了一种折叠的Gibbs采样器的阻塞方案,可以提高链混合效率。对于LDA模型,我们开发了O(log K)步骤嵌套采样(K是主题数),以直接模拟每个块的潜在变量。为了获得针对HDP-M模型的这种阻塞方案,我们引入了剩余分配过程(RAP),它可以按类的方式构造由Dirichlet过程引起的随机分区,并提出分层RAP来构造由HDP-M模型引起的随机分区。从剩余分配构造派生而来,阻止方案包括对现有主题的潜在变量的嵌套采样和对新主题的潜在变量的残留分配采样。我们证明了阻塞方案在链混合中实现了实质性的改进,并显着减少了计算时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号