...
首页> 外文期刊>Knowledge and information systems >Discovering topic structures of a temporally evolving document corpus
【24h】

Discovering topic structures of a temporally evolving document corpus

机译:发现时间不断发展的文档语料库主题结构

获取原文
获取原文并翻译 | 示例
           

摘要

In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of changes that the model can capture. Our key technical contribution is a framework based on (i) discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes: emergence and disappearance, evolution, splitting, and merging. The power of the proposed framework is demonstrated on two medical literature corpora concerned with the autism spectrum disorder (ASD) and the metabolic syndrome (MetS)-both increasingly important research subjects with significant social and healthcare consequences. In addition to the collected ASD and metabolic syndrome literature corpora which we made freely available, our contribution also includes an extensive empirical analysis of the proposed framework. We describe a detailed and careful examination of the effects that our algorithms's free parameters have on its output and discuss the significance of the findings both in the context of the practical application of our algorithm as well as in the context of the existing body of work on temporal topic analysis. Our quantitative analysis is followed by several qualitative case studies highly relevant to the current research on ASD and MetS, on which our algorithm is shown to capture well the actual developments in these fields.
机译:在本文中,我们描述了一种用于发现数据语料库的主题内容的新框架,以及在时间维度上跟踪其复杂的结构变化。与以前的工作相比,我们的模型不会在将文件添加到语料库中的速率之前施加,也不会采用马尔科夫假设,这通常限制模型可以捕获的变化类型。我们的主要技术贡献是基于(i)使用基于分层Dirichlet进程的模型的时代的时间(i)将时间分散化的框架,(ii)epoch-wise主题发现,(iii)允许复杂的建模的时间相似图。主题变化:出现和消失,进化,分裂和合并。拟议的框架的力量在关注自闭症谱系障碍(ASD)和代谢综合征(METS) - 越来越重要的研究受试者,具有重要的社会和医疗保健后果的两种医学文献。除了我们自由提供的收集的ASD和代谢综合征文献Corpora,我们的贡献还包括对拟议框架的广泛实证分析。我们描述了对我们算法的自由参数对其输出的影响的详细和仔细检查,并在我们的算法的实际应用中讨论了调查结果的重要性以及在现有工作机构的背景下时间主题分析。我们的定量分析之后是与当前关于ASD和MET的研究高度相关的定性案例研究,其中我们的算法显示在这些领域的实际情况下捕获。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号