...
【24h】

Differential Topic Models

机译:差异主题模型

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In applications we may want to different document collections: they could have shared content but also different and unique aspects in particular collections. This task has been called comparative text mining or cross-collection modeling. We present a for this application that models both topic differences and similarities. For this we use hierarchical Bayesian nonparametric models. Moreover, we found it was important to properly model power-law phenomena in topic-word distributions and thus we used the full Pitman-Yor process rather than just a Dirichlet process. Furthermore, we propose the transformed Pitman-Yor process (TPYP) to incorporate prior knowledge such as vocabulary variations in different collections into the model. To deal with the non-conjugate issue between model prior and likelihood in the TPYP, we thus propose an efficient sampling algorithm using a data augmentation technique based on the multinomial theorem. Experimental results show the model discovers interesting aspects of different collections. We also show the proposed MCMC based algorithm achieves a dramatically reduced test perplexity compared to some existing topic models. Finally, we show our model outperforms the state-of-the-art for document classification/ideology prediction on a number of text collections.
机译:在应用程序中,我们可能想要不同的文档集合:它们可能具有共享的内容,但在特定的集合中可能具有不同且独特的方面。此任务称为比较文本挖掘或交叉收集建模。我们为该应用程序提供了一个模型,该模型同时对主题差异和相似性进行建模。为此,我们使用分层贝叶斯非参数模型。此外,我们发现在主题词分布中正确建模幂律现象非常重要,因此我们使用了完整的Pitman-Yor过程,而不仅仅是Dirichlet过程。此外,我们提出了经过改进的Pitman-Yor过程(TPYP),以将先验知识(例如不同集合中的词汇变化)纳入模型。为了处理TPYP中模型先验与似然之间的非共轭问题,因此,我们基于多项式定理,提出了一种使用数据增强技术的有效采样算法。实验结果表明,该模型发现了不同馆藏的有趣方面。我们还显示,与某些现有主题模型相比,基于MCMC的算法可以显着降低测试的复杂性。最后,我们展示了我们的模型在许多文本集合上的性能优于最新的文档分类/意识形态预测。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号