...
首页> 外文期刊>Data mining and knowledge discovery >Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm
【24h】

Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm

机译:标记短语潜在Dirichlet分配及其在线学习算法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

There is a mass of user-marked text data on the Internet, such as web pages with categories, papers with corresponding keywords, and tweets with hashtags. In recent years, supervised topic models, such as Labeled Latent Dirichlet Allocation, have been widely used to discover the abstract topics in labeled text corpora. However, none of these topic models have taken into consideration word order under the bag-of-words assumption, which will obviously lose a lot of semantic information. In this paper, in order to synchronously model semantical label information and word order, we propose a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order. In order to obtain the parameter estimation for the proposed LPLDA model, we develop a batch inference algorithm based on Gibbs sampling technique. Moreover, to accelerate the LPLDA's processing speed for large-scale stream data, we further propose an online inference algorithm for LPLDA. Extensive experiments were conducted among LPLDA and four state-of-the-art baselines. The results show (1) batch LPLDA significantly outperforms baselines in terms of case study, perplexity and scalability, and the third party task in most cases; (2) the online algorithm for LPLDA is obviously more efficient than batch method under the premise of good results.
机译:Internet上有大量的用户标记文本数据,例如带有类别的网页,具有相应关键字的文件,以及带有Hashtags的推文。近年来,被监督主题模型(例如标记的潜在Dirichlet分配)已被广泛用于发现标记文本语料库中的抽象主题。但是,这些主题模型中都没有考虑在文字袋的假设下的Word顺序,这显然会丢失很多语义信息。在本文中,为了同步地模拟语义标签信息和单词顺序,我们提出了一种新颖的主题模型,称为标记短语潜在的Dirichlet分配(LPLDA),这将每个文档视为短语的混合,部分地考虑单词顺序。为了获得所提出的LPLDA模型的参数估计,我们开发了一种基于GIBBS采样技术的批量推理算法。此外,为了加速LPLDA的大规模流数据的处理速度,我们还提出了一种用于LPLDA的在线推理算法。在LPLDA和四个最先进的基线之间进行了广泛的实验。结果表明(1)批量LPLDA在大多数情况下,在案例研究,困惑和可扩展性方面显着优于基线,以及第三方任务; (2)在良好效果的前提下,LPLDA的在线算法明显比批量方法更有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号