A generalized topic modeling approach for automatic document annotation

Suppawong Tuarob; Line C. Pouchard; Prasenjit Mitra; C. Lee Giles

首页> 外文期刊>International Journal on Digital Libraries >A generalized topic modeling approach for automatic document annotation

【24h】

A generalized topic modeling approach for automatic document annotation

机译：用于自动文档注释的通用主题建模方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

机译：生态和环境科学已经变得更加先进和复杂，需要来自多个地方，时间和主题范围的观察和实验数据来验证其假设。随着时间的流逝，此类数据不仅数量增加，而且分布在世界各地的数据源的多样性和异构性也有所增加。对于必须手动搜索所需数据的科学家而言，这种异质性构成了巨大的挑战。 ONEMercury最近已作为DataONE项目的一部分实施，以缓解此类问题并充当访问全球环境和观测数据的门户。 ONEMercury从多个档案和存储库中收集元数据记录，并使它们可搜索。但是，收获的元数据记录有时注释不善或缺少有意义的关键字，这可能会妨碍有效的检索。我们提出了一种从元数据记录的标注正确的集合中学习标注的方法，以自动标注标注不正确的标注。首先将问题转换为带有受控标签库的标签推荐问题。然后，提出了用于自动标签推荐的算法的两个变体。在环境科学元数据记录的四个数据集上进行的实验表明，我们的方法性能良好，并且揭示了不同数据集的性质。我们还将讨论相关主题，例如使用主题连贯性来微调参数和进行跨归档注释的实验。

著录项

来源
《International Journal on Digital Libraries》 |2015年第2期|111-128|共18页
作者
Suppawong Tuarob; Line C. Pouchard; Prasenjit Mitra; C. Lee Giles;
展开▼
作者单位

Computer Science and Engineering The Pennsylvania State University">(1);

Purdue University">(3);

Qatar Computing Research Institute">(4);

Computer Science and Engineering The Pennsylvania State University">(1);

Information Science and Technology The Pennsylvania State University">(2);

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Metadata annotation; Topic model; Tag recommendation;

机译：元数据注释;主题模型;标签推荐;

相似文献

外文文献
中文文献
专利

1. A generalized topic modeling approach for automatic document annotation [J] . Suppawong Tuarob, Line C. Pouchard, Prasenjit Mitra, International journal on digital libraries . 2015,第2期

机译：用于自动文档注释的通用主题建模方法
2. A topic modeling based approach to novel document automatic summarization [J] . Wu Zongda, Lei Li, Li Guiling, Expert Systems with Application . 2017,第octa期

机译：基于主题建模的新颖文档自动摘要方法
3. Integrating social annotations into topic models for personalized document retrieval [J] . Soft computing: A fusion of foundations, methodologies and applications . 2020,第3期

机译：将社交注释集成到个性化文档检索的主题模型中
4. An Automatic Approach for Document-level Topic Model Evaluation [C] . Shraey Bhatia, Jey Han Lau, Timothy Baldwin Conference on computational natural language learning . 2017

机译：文档级主题模型评估的自动方法
5. Language models and automatic topic categorization for information retrieval in handwritten documents [D] . Farooq, Faisal 2008

机译：用于手写文档中信息检索的语言模型和自动主题分类
6. Easing semantically enriched information retrieval—An interactive semi-automatic annotation system for medical documents [O] . Theresia Gschwandtner, Katharina Kaiser, Patrick Martini, -1

机译：在语义上富集的信息检索 - 用于医疗文档的交互式半自动注释系统
7. An Automatic Approach for Document-level Topic Model Evaluation [O] . Bhatia, Shraey, Lau, Jey Han, Baldwin, Timothy 2017

机译：文档级主题模型评估的自动化方法
8. Topics in conformal invariance and generalized sigma models [R] . Bernardo, L. M. 1997

机译：共形不变性和广义sigma模型的主题

A generalized topic modeling approach for automatic document annotation

摘要

著录项

相似文献

相关主题

期刊订阅