...
【24h】

Latent IBP Compound Dirichlet Allocation

机译:潜在IBP复合Dirichlet分配

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We introduce the four-parameter IBP compound Dirichlet process (ICDP), a stochastic process that generates sparse non-negative vectors with potentially an unbounded number of entries. If we repeatedly sample from the ICDP we can generate sparse matrices with an infinite number of columns and power-law characteristics. We apply the four-parameter ICDP to sparse nonparametric topic modelling to account for the very large number of topics present in large text corpora and the power-law distribution of the vocabulary of natural languages. The model, which we call latent IBP compound Dirichlet allocation (LIDA), allows for power-law distributions, both, in the number of topics summarising the documents and in the number of words defining each topic. It can be interpreted as a sparse variant of the hierarchical Pitman-Yor process when applied to topic modelling. We derive an efficient and simple collapsed Gibbs sampler closely related to the collapsed Gibbs sampler of latent Dirichlet allocation (LDA), making the model applicable in a wide range of domains. Our nonparametric Bayesian topic model compares favourably to the widely used hierarchical Dirichlet process and its heavy tailed version, the hierarchical Pitman-Yor process, on benchmark corpora. Experiments demonstrate that accounting for the power-distribution of real data is beneficial and that sparsity provides more interpretable results.
机译:我们介绍了四参数IBP复合Dirichlet过程(ICDP),这是一种随机过程,它生成的稀疏非负向量可能具有无穷多个条目。如果我们从ICDP中反复采样,则可以生成具有无限列和幂律特征的稀疏矩阵。我们将四参数ICDP应用于稀疏的非参数主题建模,以解决大型文本语料库中存在的大量主题以及自然语言词汇的幂律分布。我们将模型称为潜在IBP复合Dirichlet分配(LIDA),该模型可以在总结文档的主题数量和定义每个主题的单词数量方面实现幂律分布。当将其应用于主题建模时,可以将其解释为分层Pitman-Yor过程的稀疏变体。我们推导了与潜在Dirichlet分配(LDA)的折叠Gibbs采样器密切相关的高效且简单的折叠Gibbs采样器,从而使该模型可应用于广泛的领域。我们的非参数贝叶斯主题模型可与基准语料库上广泛使用的分层Dirichlet过程及其粗尾版本Pitman-Yor分层过程进行比较。实验表明,考虑真实数据的功率分配是有益的,而稀疏性则提供了更可解释的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号