首页> 外文期刊>Knowledge-Based Systems >FastBTM: Reducing the sampling time for biterm topic model
【24h】

FastBTM: Reducing the sampling time for biterm topic model

机译:FastBTM:减少双项主题模型的采样时间

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Due to the popularity of social networks, such as microblogs and Twitter, a vast amount of short text data is created every day. Much recent research in short text becomes increasingly significant, such as topic inference for short text. Biterm topic model (BTM) benefits from the word co-occurrence patterns of the corpus, which makes it perform better than conventional topic models in uncovering latent semantic relevance for short text. However, BTM resorts to Gibbs sampling to infer topics, which is very time consuming, especially for large-scale datasets or when the number of topics is extremely large. It requires 0(K) operations per sample for K topics, where K denotes the number of topics in the corpus. In this paper, we propose an acceleration algorithm of BTM, FastBTM, using an efficient sampling method for BTM, which converges much faster than BTM without degrading topic quality. FastBTM is based on Metropolis Hastings and alias method, both of which have been widely adopted in Latent Dirichlet Allocation (LDA) model and achieved outstanding speedup. Our FastBTM can effectively reduce the sampling complexity of biterm topic model from 0(K) to 0(1) amortized time. We carry out a number of experiments on three datasets including two short text datasets, Tweets2011 Collection dataset and Yahoo! Answers dataset, and one long document dataset, Enron dataset. Our experimental results show that when the number of topics K increases, the gap in running time speed between FastBTM and BTM gets especially larger. In addition, our FastBTM is effective for both short text datasets and long document datasets. (C) 2017 Elsevier B.V. All rights reserved.
机译:由于诸如微博和Twitter之类的社交网络的普及,每天都会创建大量的短文本数据。短文本的许多最新研究变得越来越重要,例如短文本的主题推断。双项主题模型(BTM)受益于语料库的单词共现模式,这使得它在显示短文本潜在的语义相关性方面比传统主题模型表现更好。但是,BTM依靠Gibbs采样来推断主题,这非常耗时,特别是对于大规模数据集或主题数量非常大的情况。对于K个主题,每个样本需要进行0(K)个操作,其中K表示语料库中的主题数。在本文中,我们提出了一种BTM的加速算法,即FastBTM,它使用一种针对BTM的有效采样方法,其收敛速度比BTM快得多,并且不会降低主题质量。 FastBTM基于Metropolis Hastings和别名方法,这两种方法已在潜在Dirichlet分配(LDA)模型中被广泛采用,并实现了出色的加速。我们的FastBTM可以有效地将双项主题模型的采样复杂度从0(K)减少到0(1)摊销时间。我们对三个数据集进行了许多实验,其中包括两个短文本数据集,Tweets2011 Collection数据集和Yahoo!。答案数据集,以及一个长文档数据集,即安然数据集。我们的实验结果表明,当主题数K增加时,FastBTM和BTM之间的运行时间速度差距特别大。此外,我们的FastBTM对于短文本数据集和长文档数据集均有效。 (C)2017 Elsevier B.V.保留所有权利。

著录项

  • 来源
    《Knowledge-Based Systems》 |2017年第15期|11-20|共10页
  • 作者单位

    Tsinghua Univ, Dept Comp Sci & Technol, State Key Lab Intelligent Technol & Syst, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China;

    Tsinghua Univ, Dept Comp Sci & Technol, State Key Lab Intelligent Technol & Syst, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China;

    Tsinghua Univ, Dept Comp Sci & Technol, State Key Lab Intelligent Technol & Syst, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China;

    Beijing Univ Posts & Telecommun, Beijing 100876, Peoples R China;

    Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    BTM; Topic model; Alias method; Metropolis-Hastings; Acceleration algorithm;

    机译:BTM;主题模型;别名方法;Metropolis-Hastings;加速算法;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号