首页> 外文会议>Discovery science >A Statistical Model for Topically Segmented Documents
【24h】

A Statistical Model for Topically Segmented Documents

机译:局部细分文档的统计模型

获取原文
获取原文并翻译 | 示例

摘要

Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.
机译:文本数据的生成模型基于以下思想:文档可以建模为主题的混合,每个主题都表示为各个术语的概率分布。传统上,此类模型假设文档是生成过程不可分割的单元,可能不适用于处理具有明确多主题结构的文档。本文提出了一个生成模型,该模型利用给定的文档分解成局部具有粘性的较小文本块(段)。引入了一个新变量来对文档内片段建模:在文档级使用此变量,词的生成不仅与主题相关,而且与片段相关,而主题潜在变量与片段直接相关,而不是与片段相关整个文档。实验结果表明,与现有的生成模型相比,我们提出的模型为语言建模提供了更好的困惑,并为有效的文档聚类提供了更好的支持。

著录项

  • 来源
    《Discovery science》|2011年|p.247-261|共15页
  • 会议地点 Espoo(FI);Espoo(FI)
  • 作者单位

    ENEA - Portici Research Center, Italy;

    Department of Electronics, Computer and Systems Sciences, University of Calabria, Italy;

    Department of Computer Science Engineering, Digital Technology Center, University of Minnesota, Minneapolis, USA;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 人工智能理论;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号