首页> 外文会议>European conference on machine learning and knowledge discovery in databases >Nested Hierarchical Dirichlet Process for Nonparametric Entity-Topic Analysis
【24h】

Nested Hierarchical Dirichlet Process for Nonparametric Entity-Topic Analysis

机译:用于非参数实体主题分析的嵌套层次Dirichlet过程

获取原文

摘要

The Hierarchical Dirichlet Process (HDP) is a Bayesian non-parametric prior for grouped data, such as collections of documents, where each group is a mixture of a set of shared mixture densities, or topics, where the number of topics is not fixed, but grows with data size. The Nested Dirichlet Process (NDP) builds on the HDP to cluster the documents, but allowing them to choose only from a set of specific topic mixtures. In many applications, such a set of topic mixtures may be identified with the set of entities for the collection. However, in many applications, multiple entities are associated with documents, and often the set of entities may also not be known completely in advance. In this paper, we address this problem using a nested HDP (nHDP), where the base distribution of an outer HDP is itself an HDP. The inner HDP creates a countably infinite set of topic mixtures and associates them with entities, while the outer HDP associates documents with these entities or topic mixtures. Making use of a nested Chinese Restaurant Franchise (nCRF) representation for the nested HDP, we propose a collapsed Gibbs sampling based inference algorithm for the model. Because of couplings between two HDP levels, scaling up is naturally a challenge for the inference algorithm. We propose an inference algorithm by extending the direct sampling scheme of the HDP to two levels. In our experiments on two real world research corpora, we show that, even when large fractions of author entities are hidden, the nHDP is able to generalize significantly better than existing models. More importantly, we are able to detect missing authors at a reasonable level of accuracy.
机译:分层Dirichlet流程(HDP)是用于分组数据(例如文档集合)的贝叶斯非参数先验,其中每个组是一组共享的混合密度或主题的混合,其中主题的数量不固定,但随着数据大小的增长而增长。嵌套Dirichlet流程(NDP)建立在HDP上以对文档进行聚类,但仅允许它们从一组特定的主题组合中进行选择。在许多应用中,这样的一组主题混合物可以与用于收集的一组实体一起识别。然而,在许多应用中,多个实体与文档相关联,并且通常也可能不完全预先知道实体的集合。在本文中,我们使用嵌套HDP(nHDP)解决此问题,其中外部HDP的基本分布本身就是HDP。内部HDP创建无限多个主题混合并将其与实体相关联,而外部HDP将文档与这些实体或主题混合相关联。利用嵌套的HDP的嵌套中国餐厅特许经营(nCRF)表示,我们为模型提出了一种基于折叠Gibbs采样的推理算法。由于两个HDP级别之间的耦合,因此对于推理算法而言,按比例放大自然是一个挑战。通过将HDP的直接采样方案扩展到两个级别,我们提出了一种推理算法。在我们对两个现实世界研究语料库的实验中,我们表明,即使隐藏了很大一部分作者实体,nHDP的泛化能力也明显优于现有模型。更重要的是,我们能够以合理的准确度检测出失踪的作者。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号