首页> 外文期刊>International Journal on Digital Libraries >A generalized topic modeling approach for automatic document annotation
【24h】

A generalized topic modeling approach for automatic document annotation

机译:用于自动文档注释的通用主题建模方法

获取原文
获取原文并翻译 | 示例
           

摘要

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.
机译:生态和环境科学已经变得更加先进和复杂,需要来自多个地方,时间和主题范围的观察和实验数据来验证其假设。随着时间的流逝,此类数据不仅数量增加,而且分布在世界各地的数据源的多样性和异构性也有所增加。对于必须手动搜索所需数据的科学家而言,这种异质性构成了巨大的挑战。 ONEMercury最近已作为DataONE项目的一部分实施,以缓解此类问题并充当访问全球环境和观测数据的门户。 ONEMercury从多个档案和存储库中收集元数据记录,并使它们可搜索。但是,收获的元数据记录有时注释不善或缺少有意义的关键字,这可能会妨碍有效的检索。我们提出了一种从元数据记录的标注正确的集合中学习标注的方法,以自动标注标注不正确的标注。首先将问题转换为带有受控标签库的标签推荐问题。然后,提出了用于自动标签推荐的算法的两个变体。在环境科学元数据记录的四个数据集上进行的实验表明,我们的方法性能良好,并且揭示了不同数据集的性质。我们还将讨论相关主题,例如使用主题连贯性来微调参数和进行跨归档注释的实验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号