首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Using Hashtag Graph-Based Topic Model to Connect Semantically-Related Words Without Co-Occurrence in Microblogs
【24h】

Using Hashtag Graph-Based Topic Model to Connect Semantically-Related Words Without Co-Occurrence in Microblogs

机译:使用基于Hashtag图的主题模型连接语义相关的单词,而无需在微博中同时出现

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we introduce a new topic model to understand the chaotic microblogging environment by using hashtag graphs. Inferring topics on Twitter becomes a vital but challenging task in many important applications. The shortness and informality of tweets leads to extreme sparse vector representations with a large vocabulary. This makes the conventional topic models (e.g., latent Dirichlet allocation [1] and latent semantic analysis [2] ) fail to learn high quality topic structures. Tweets are always showing up with rich user-generated hashtags. The hashtags make tweets semi-structured inside and semantically related to each other. Since hashtags are utilized as keywords in tweets to mark messages or to form conversations, they provide an additional path to connect semantically related words. In this paper, treating tweets as semi-structured texts, we propose a novel topic model, denoted as Hashtag Graph-based Topic Model (HGTM) to discover topics of tweets. By utilizing hashtag relation information in hashtag graphs, HGTM is able to discover word semantic relations even if words are not co-occurred within a specific tweet. With this method, HGTM successfully alleviates the sparsity problem. Our investigation illustrates that the user-contributed hashtags could serve as weakly-supervised information for topic modeling, and the relation between hashtags could reveal latent semantic relation between words. We evaluate the effectiveness of HGTM on tweet (hashtag) clustering and hashtag classification problems. Experiments on two real-world tweet data sets show that HGTM has strong capability to handle sparseness and noise problem in tweets. Furthermore, HGTM can discover more distinct and coherent topics than the state-of-the-art baselines.
机译:在本文中,我们引入了一个新的主题模型,以通过使用标签图来了解混沌微博环境。在许多重要应用程序中,在Twitter上推断主题成为一项至关重要但具有挑战性的任务。 Tweet的简短性和非正式性导致带有大量词汇的极端稀疏矢量表示。这使得传统的主题模型(例如,潜在的Dirichlet分配[1]和潜在的语义分析[2])无法学习高质量的主题结构。推文总是显示带有丰富的用户生成的标签。主题标签使推文内部半结构化,并且在语义上相互关联。由于主题标签在推文中用作关键字来标记消息或形成对话,因此它们提供了连接语义相关单词的附加路径。在本文中,将推文视为半结构化文本,我们提出了一种新颖的主题模型,称为基于Hashtag图的主题模型(HGTM),用于发现推文的主题。通过利用标签图中的标签关系信息,即使在特定推文中未同时出现单词,HGTM也能够发现单词语义关系。通过这种方法,HGTM成功地缓解了稀疏性问题。我们的研究表明,用户提供的主题标签可以用作主题建模的弱监督信息,并且主题标签之间的关系可以揭示单词之间的潜在语义关系。我们评估了HGTM在tweet(标签)聚类和标签分类问题上的有效性。在两个实际的推文数据集上进行的实验表明,HGTM具有强大的能力来处理推文中的稀疏和噪声问题。此外,与最先进的基准相比,HGTM可以发现更多与众不同且连贯的主题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号