Topic modeling approaches, such as Latent Dirichlet Allocation (LDA) and Hierarchical LDA (hLDA) have been used extensively to discover topics in various corpora. Unfortunately, these approaches do not perform well when applied to collections of social media posts. Further, these approaches do not allow users to focus topic discovery around subjectively interesting concepts. We propose the new Semi-Supervised Microblog-hLDA (SS-Micro-hLDA) model to discover topic hierarchies in short, noisy microblog documents in a way that allows users to focus topic discovery around interesting areas. We test SS-Micro-hLDA using a large, public collection of Twitter messages and Reddit social blogging site and show that our model outperforms hLDA, Constrained-hLDA, Recursive-rCRP and TSSB in terms of Pointwise Mutual Information (PMI) Score. Further, we test our model in terms of information entropy of held-out data and show that the new approach produces highly focused topic hierarchies.
展开▼
机译:主题建模方法,例如潜在的Dirichlet分配(LDA)和分层LDA(HLDA)已广泛用于发现各种语料的主题。不幸的是,这些方法在应用于社交媒体帖子的集合时,这些方法并不符合良好。此外,这些方法不允许用户在主观有趣的概念周围专注于主题发现。我们提出了新的半监督微博-HLDA(SS-Micro-HLDA)模型,以发现短嘈杂的微博文档的主题层次结构,以便用户允许用户对焦于有趣区域的主题发现。我们使用大型公共的Twitter消息和Reddit Social Blogging站点测试SS-Micro-HLDA,并显示我们的模型以叉点互信息(PMI)得分而胜过HLDA,约束 - HLDA,RECUSUSIVE-RCRP和TSSB。此外,我们在列出数据的信息熵方面测试我们的模型,并显示新方法产生高度集中的主题层次结构。
展开▼