首页> 外文会议>Conference of Open Innovations Association >Fast and modular regularized topic modelling
【24h】

Fast and modular regularized topic modelling

机译:快速模块化的正则化主题建模

获取原文

摘要

Topic modelling is an area of text mining that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. In applications, there are often many requirements, such as, for example, problem-specific knowledge and additional data, to be taken into account. Therefore, it is natural for topic modelling to be considered a multiobjective optimization problem. However, historically, Bayesian learning became the most popular approach for topic modelling. In the Bayesian paradigm, all requirements are formalized in terms of a probabilistic generative process. This approach is not always convenient due to some limitations and technical difficulties. In this work, we develop a non-Bayesian multiobjective approach called the Additive Regularization of Topic Models (ARTM). It is based on regularized Maximum Likelihood Estimation (MLE), and we show that many of the well-known Bayesian topic models can be re-formulated in a much simpler way using the regularization point of view. We review some of the most important types of topic models: multimodal, multilingual, temporal, hierarchical, graph-based, and short-text. The ARTM framework enables easy combination of different types of models to create new models with the desired properties for applications. This modular “lego-style” technology for topic modelling is implemented in the open-source library BigARTM.
机译:主题建模是文本挖掘的一个领域,该领域在过去15年中得到了积极的发展。概率主题模型从文本文档集合中提取一组隐藏的主题。它通过单词的概率分布来定义每个主题,并使用主题的概率分布来描述每个文档。在应用程序中,通常需要考虑许多要求,例如,特定于问题的知识和其他数据。因此,将主题建模视为多目标优化问题是很自然的。但是,从历史上看,贝叶斯学习成为主题建模的最流行方法。在贝叶斯范式中,所有要求都按照概率生成过程形式化。由于某些限制和技术困难,这种方法并不总是很方便。在这项工作中,我们开发了一种非贝叶斯多目标方法,称为主题模型的可加正则化(ARTM)。它基于正则化的最大似然估计(MLE),并且我们展示了许多众所周知的贝叶斯主题模型都可以使用正则化的观点以更简单的方式重新构建。我们回顾了一些最重要的主题模型类型:多模式,多语言,时间,层次,基于图的文本和短文本。 ARTM框架可轻松组合不同类型的模型,以创建具有所需应用程序属性的新模型。这种用于主题建模的模块化“乐高风格”技术是在开源库BigARTM中实现的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号