Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora

机译：主题裁剪：利用潜在主题分析小型语料库

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

机译：主题建模具有很多人气作为识别和描述文本文档和整体语料库的局部结构的手段。然而，许多文件集合如数字人文学科的定性研究，不能容易受益于这项技术。这些公司的有限规模导致质量差的主题模型。可以通过结合具有类似主题内容的额外域特定文档来学习更高质量的主题模型。然而，这需要发现甚至手动构成这样的公司，需要相当大的努力。为了解决这个问题，我们提出了一个全自动适应的主题裁剪过程。对于学习主题，此过程自动量身定制来自诸如维基百科的一般语料库的域特定的裁剪语料库。然后通过主题推断映射到了学习的主题模型。与现实世界数据集的评估表明，学习的主题的质量更高，而不是从工作组织中学到的那些。详细说明，我们在一致性，多样性和相关性方面分析了学习的主题。

著录项

来源
《International Conference on Theory and Practice of Digital Libraries》|2013年||共12页
会议地点
作者
Nam Khanh Tran; Sergej Zerr; Kerstin Bischoff; Claudia Niederee; Ralf Krestel;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 G250.76-53;
关键词
Digital humanities; Qualitative data; Topic modeling;

机译：数字人文;定性数据;主题建模;

相似文献

外文文献
中文文献
专利

1. A novel fuzzy k-means latent semantic analysis (FKLSA) approach for topic modeling over medical and health text corpora [J] . Rashid Junaid, Shah Syed Muhammad Adnan, Irtaza Aun Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2019,第5aPta2期

机译：关于医疗和健康文本语料库主题建模的新型模糊k型潜在语义分析（FKLSA）方法
2. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora [J] . Ivan Vulić, Wim De Smet, Marie-Francine Moens Information Retrieval . 2013,第3期

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练
3. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora [J] . Ivan Vulic, Wim De Smet, Marie-Francine Moens Information retrieval . 2013,第3期

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练
4. Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora [C] . Nam Khanh Tran, Sergej Zerr, Kerstin Bischoff, International conference on theory and practice of digital libraries . 2013

机译：主题裁剪：利用潜在主题分析小型语料库
5. Topic Modeling of Hierarchical Corpora [D] . Kim, Do-kyum. 2014

机译：分层语料的主题建模
6. A Systematic Review of Perennial Staple Crops Literature Using Topic Modeling and Bibliometric Analysis [O] . Daniel A. Kane, Paul Rogé, Sieglinde S. Snapp -1

机译：使用主题模型和文献计量分析系统对多年生主季作物文献进行系统回顾
7. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora [O] . Vulic Ivan, De Smet Wim, Moens Marie-Francine 2013

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练

Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅