首页> 外文会议>International Joint Conference on Neural Networks >Topic Discovery for Streaming Short Texts with CTM
【24h】

Topic Discovery for Streaming Short Texts with CTM

机译:使用CTM流式传输短文本的主题发现

获取原文

摘要

Short texts are prevalent on today's Web, especially with the emergence of social media. However, how to discover the topics of streaming short texts has become an important task for many content analysis applications. Conventional topic models such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) will suffer from sparsity problem when we infer the latent topics from short texts with them. The reason is that they derive topics from document-level word co-occurrence by modeling each document as a mixture of topics. Different from the above idea, Biterm Topic Model (BTM) discovers topics in short texts by directly modeling the generation of word co-occurrence patterns in the whole corpus. But semantic information is lacking for short texts. In this paper, in order to alleviate the sparsity problem, keep the semantic information of documents and get the latent topic information of streaming short texts immediately, we propose a joint topic model for Chinese streaming short texts (CTM) based on the online algorithms of LDA and BTM. Experiments on short texts from Sina Weibo show that our joint topic model can discover more precise topics and carry out more applications. In addition, considering the preprocessing in Chinese text is different from English and errors in extracting key phrases, we use a combined word method to extend the length of short texts and reduce errors in extracting key phrases.
机译:短文本在当今的网络上很普遍,尤其是随着社交媒体的出现。但是,如何发现流短文本的主题已成为许多内容分析应用程序的重要任务。当我们从短文本中推断潜在主题时,诸如概率潜在语义分析(PLSA)和潜在狄利克雷分配(LDA)之类的常规主题模型将遭受稀疏性问题的困扰。原因是它们通过将每个文档建模为主题的混合而从文档级单词共现中衍生出主题。与上述想法不同,Biterm主题模型(BTM)通过直接对整个语料库中单词共现模式的生成进行建模来发现短文本中的主题。但是短文本缺少语义信息。为了缓解稀疏性问题,保留文档的语义信息并立即获取流短文本的潜在主题信息,我们提出了一种基于流媒体在线算法的中文流短文本联合主题模型。 LDA和BTM。来自新浪微博的短文本实验表明,我们的联合主题模型可以发现更精确的主题并进行更多的应用。另外,考虑到中文文本的预处理不同于英语,并且在提取关键短语时会出现错误,因此我们采用组合词法来延长短文本的长度并减少提取关键短语时的错误。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号