...
首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Parsimonious Topic Models with Salient Word Discovery
【24h】

Parsimonious Topic Models with Salient Word Discovery

机译:具有显着词发现功能的简约主题模型

获取原文
获取原文并翻译 | 示例

摘要

We propose a parsimonious topic model for text corpora. In related models such as Latent Dirichlet Allocation (LDA), all words are modeled topic-specifically, even though many words occur with similar frequencies across different topics. Our modeling determines salient words for each topic, which have topic-specific probabilities, with the rest explained by a universal shared model. Further, in LDA all topics are in principle present in every document. By contrast, our model gives sparse topic representation, determining the (small) subset of relevant topics for each document. We derive a Bayesian Information Criterion (BIC), balancing model complexity and goodness of fit. Here, interestingly, we identify an effective sample size and corresponding penalty specific to each parameter type in our model. We minimize BIC to jointly determine our entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. Results on three text corpora and an image dataset show that our model achieves higher test set likelihood and better agreement with ground-truth class labels, compared to LDA and to a model designed to incorporate sparsity.
机译:我们为文本语料库提出了一个简约主题模型。在诸如潜在狄利克雷分配(LDA)之类的相关模型中,所有单词都是针对特定主题建模的,即使许多单词在不同主题之间的出现频率相似。我们的建模确定每个主题的显着单词,这些单词具有特定主题的概率,其余的则由通用共享模型解释。此外,LDA中的所有主题原则上都存在于每个文档中。相比之下,我们的模型给出的主题稀疏,确定了每个文档相关主题的(小)子集。我们推导了贝叶斯信息准则(BIC),可以平衡模型的复杂性和拟合优度。在这里,有趣的是,我们确定了模型中每种参数类型特定的有效样本量和相应的惩罚。我们将BIC最小化,以完全不受监管的方式共同确定我们的整个模型-特定于主题的单词,特定于文档的主题,所有模型参数值以及主题的总数。在三个文本语料库和一个图像数据集上的结果表明,与LDA和旨在合并稀疏性的模型相比,我们的模型具有更高的测试集可能性以及与真实类标签的一致性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号