首页> 外文会议>International Conference on Theory and Practice of Digital Libraries >Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora
【24h】

Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora

机译:主题裁剪:利用潜在主题分析小型语料库

获取原文

摘要

Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.
机译:主题建模具有很多人气作为识别和描述文本文档和整体语料库的局部结构的手段。然而,许多文件集合如数字人文学科的定性研究,不能容易受益于这项技术。这些公司的有限规模导致质量差的主题模型。可以通过结合具有类似主题内容的额外域特定文档来学习更高质量的主题模型。然而,这需要发现甚至手动构成这样的公司,需要相当大的努力。为了解决这个问题,我们提出了一个全自动适应的主题裁剪过程。对于学习主题,此过程自动量身定制来自诸如维基百科的一般语料库的域特定的裁剪语料库。然后通过主题推断映射到了学习的主题模型。与现实世界数据集的评估表明,学习的主题的质量更高,而不是从工作组织中学到的那些。详细说明,我们在一致性,多样性和相关性方面分析了学习的主题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号