【24h】

Assessing the Uncertainty of the Text Generating Process Using Topic Models

机译:使用主题模型评估文本生成过程的不确定性

获取原文

摘要

Latent Dirichlet Allocation (LDA) is one of the most popular topic models employed for the analysis of large text data. When applied repeatedly to the same text corpus, LDA leads to different results. To address this issue, several methods have been proposed. In this paper, instead of dealing with this methodological source of algorithmic uncertainty, we assess the aleatoric uncertainty of the text generating process itself. For this task, we use a direct LDA-model approach to quantify the uncertainty due to the random process of text generation and propose three different bootstrap approaches to resample texts. These allow to construct uncertainty intervals of topic proportions for single texts as well as for text corpora over time. We discuss the differences of the uncertainty intervals derived from the three bootstrap approaches and the direct approach for single texts and for aggregations of texts. We present the results of an application of the proposed methods to an example corpus consisting of all published articles in a German daily quality newspaper of one full year and investigate the effect of different sample sizes to the uncertainty intervals.
机译:潜在的Dirichlet分配(LDA)是用于分析大文本数据的最受欢迎的主题模型之一。当反复应用于同一文本语料库时,LDA会导致不同的结果。为解决这个问题,已经提出了几种方法。在本文中,而不是处理这种方法的算法不确定性来源,我们评估了文本生成过程本身的梯度不确定性。对于此任务,我们使用直接LDA模型方法来量化由于文本生成的随机过程,并提出了三种不同的引导方法来重新制定文本。这些允许构建单个文本的主题比例的不确定性间隔以及随时间的文本语料库。我们讨论了从三个引导方法和单一文本的直接方法和文本聚合的差异。我们介绍了拟议方法的应用程序,以举个例子组成的德国日常素质报纸中的所有已发表的文章,并调查不同样本大小对不确定性间隔的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号