首页> 外文会议>International conference on world wide web >The Dual-Sparse Topic Model: Mining Focused Topics and Focused Terms in Short Text
【24h】

The Dual-Sparse Topic Model: Mining Focused Topics and Focused Terms in Short Text

机译:双重稀疏主题模型:在短文本中挖掘重点主题和重点术语

获取原文

摘要

Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate on several salient topics instead of covering a wide variety of topics. A real topic also adopts a narrow range of terms instead of a wide coverage of the vocabulary. Understanding this sparsity of information is especially important for analyzing user-generated Web content and social media, which are featured as extremely short posts and condensed discussions. In this paper, we propose a dual-sparse topic model that addresses the sparsity in both the topic mixtures and the word usage. By applying a "Spike and Slab" prior to decouple the sparsity and smoothness of the document-topic and topic-word distributions, we allow individual documents to select a few focused topics and a topic to select focused terms, respectively. Experiments on different genres of large corpora demonstrate that the dual-sparse topic model outperforms both classical topic models and existing sparsity-enhanced topic models. This improvement is especially notable on collections of short documents.
机译:主题建模已被证明是探索性文本挖掘的有效方法。大多数主题模型的一个普遍假设是,文档是由多个主题混合生成的。在现实世界中,单个文档通常集中于几个突出的主题,而不是涵盖各种各样的主题。一个真实的主题还采用了狭窄的术语范围,而不是广泛的词汇范围。理解这种稀疏信息对于分析用户生成的Web内容和社交媒体尤其重要,这些内容以极短的帖子和简短的讨论为特征。在本文中,我们提出了一种双稀疏主题模型,该模型解决了主题混合和单词用法中的稀疏性。通过在解耦文档主题和主题词分布的稀疏性和平滑度之前应用“峰值和平板”,我们允许单个文档分别选择一些重点主题和一个主题以选择重点术语。对不同类型的大型语料库进行的实验表明,双稀疏主题模型的性能优于经典主题模型和现有的稀疏增强主题模型。这种改进在短文档收集方面尤为明显。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号