...
首页> 外文期刊>Science Advances >A network approach to topic models
【24h】

A network approach to topic models

机译:主题模型的网络方法

获取原文
   

获取外文期刊封面封底 >>

       

摘要

One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with nonparametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. Our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.
机译:现代时代的主要计算和科学挑战之一是从非结构化文本中提取有用的信息。主题模型是一种流行的机器学习方法,可以推断文档集合的潜在主题结构。尽管获得了成功(特别是最广泛使用的称为潜在狄利克雷分配(LDA)的变体)以及在社会学,历史和语言学中的大量应用,但是已知主题模型会遇到严重的概念和实践问题,例如,缺乏合理性对于贝叶斯先验,真实文本的统计属性存在差异,并且无法正确选择主题数。通过将主题结构与在复杂网络中查找社区的问题相关联,我们获得了一种确定主题结构的新观点。我们通过将文本语料库表示为文档和单词的双向网络来实现。通过改编现有的社区检测方法(使用具有非参数先验的随机块模型(SBM)),我们获得了一种更通用,更原则的主题建模框架(例如,它会自动检测主题的数量并将单词和文件)。对人工和真实语料库的分析表明,就统计模型选择而言,我们的SBM方法比LDA导致更好的主题模型。我们的工作展示了如何将社区检测和主题建模中的方法正式关联起来,从而为这两个领域之间的交叉应用开辟了可能性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号