首页> 外文会议>ACM international conference on information and knowledge management >Modeling Topic Hierarchies with the Recursive Chinese Restaurant Process
【24h】

Modeling Topic Hierarchies with the Recursive Chinese Restaurant Process

机译:用递归中国餐厅过程建模主题层次结构

获取原文

摘要

Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using held-out likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family.
机译:潜在的Dirichlet分配(LDA)和分层Dirichlet流程(HDP)等主题模型是从一组未注释文档中发现主题的简单解决方案。尽管它们既简单又流行,但是LDA和HDP的主要缺点是它们没有将主题组织为在许多数据集中自然可见的层次结构。我们介绍了递归中餐厅流程(rCRP)和以rCRP为先验的非参数主题模型,以发现具有无限深度和宽度的分层主题结构。与以前的用于发现主题层次结构的模型不同,rCRP允许根据层次结构中整个主题集的混合生成文档。我们将rCRP应用于《纽约时报》文章的语料库,MovieLens评分的数据集和一组Wikipedia文章,并显示发现的主题层次结构。我们使用保留的可能性将rCRP的预测能力与LDA,HDP和中式餐厅过程(nCRP)进行比较,以显示rCRP优于其他方法。我们建议使用两个指标来量化主题层次结构的特征,以比较发现的rCRP和nCRP主题层次结构。结果表明,rCRP发现了一个层次结构,在该层次结构中,主题变得更加针对叶子,并且直系家族中的主题比直系家族之外的主题更具亲和力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号