首页> 外文会议>International Conference on Multimedia Big Data >Contextual-LDA: A Context Coherent Latent Topic Model for Mining Large Corpora
【24h】

Contextual-LDA: A Context Coherent Latent Topic Model for Mining Large Corpora

机译:上下文LDA:挖掘大型语料库的上下文一致性潜在主题模型

获取原文

摘要

Statistical topic models represented by Latent Dirichlet Allocation (LDA) and its variants are ubiquitously applied to understanding large corpora. Meanwhile, topic models based on bag-of-words (Bow) rarely adopt contextual information, which encompasses enormous amount of serviceable knowledge in a document, into the probabilistic framework. This shortcoming of LDA leads to its failing to learn contextual information in sentences and paragraphs. We present a contextual coherent topic model for text learning namely Contextual Latent Dirichlet Allocation (Contextual-LDA) to include the contextual knowledge without increasing the perplexity of the algorithm very much. In our model, a document is segmented into finelydivided word sequences, each corresponded with one distinct latent topic to capture local context, while the global context is obtained by the location a segment appears in the document. We learn parameters using Gibbs sampling analogous to traditional LDA. Our model takes advantage of statistical strength of BoW through extending LDA without ignoring knowledge contained in the original context of documents. We also demonstrate it in supervised scenario. While comparing to LDA model, experiment results on BBC corpus in both unsupervised and supervised settings reveal our method is finely adapted for text mining.
机译:以潜在狄利克雷分配(LDA)及其变体为代表的统计主题模型普遍应用于理解大型语料库。同时,基于单词袋(Bow)的主题模型很少将上下文信息纳入概率框架,上下文信息将文档中包含的大量可服务知识包含在内。 LDA的这一缺点导致它无法学习句子和段落中的上下文信息。我们提出了一种用于文本学习的上下文相关主题模型,即上下文潜在Dirichlet分配(Contextual-LDA),以包括上下文知识,而不会大大增加算法的复杂性。在我们的模型中,文档被细分为细分的单词序列,每个单词序列对应一个不同的潜在主题以捕获局部上下文,而全局上下文则通过片段出现在文档中的位置获得。我们使用类似于传统LDA的Gibbs采样来学习参数。我们的模型通过扩展LDA来利用BoW的统计强度,而不会忽略文档原始上下文中包含的知识。我们还将在有监督的情况下进行演示。与LDA模型相比,在无监督和有监督的情况下,BBC语料库的实验结果表明,我们的方法非常适合文本挖掘。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号