首页> 外文会议>Archiving 2013 >Use of Descriptive Metadata as a Knowledgebase for Analyzing Data in Large Textual Collections
【24h】

Use of Descriptive Metadata as a Knowledgebase for Analyzing Data in Large Textual Collections

机译:使用描述性元数据作为知识库来分析大型文本集中的数据

获取原文
获取原文并翻译 | 示例

摘要

Descriptive metadata, such as an article's title, authors, institutional affiliations, keywords and date of publication, collected either manually or automatically from documents contents, is often used to search and retrieve relevant documents in an archived collection. This metadata, especially for a large text corpus such as a biomedical collection, may encapsulate patterns, trends, and other valuable information, usually revealed by using specialized data analysis software to answer specific questions. A more useful, generalized approach is to repurpose this metadata to serve as a knowledgebase to answer appropriate semantic queries. At the US National Library of Medicine (NLM), we recently archived a large biomedical collection comprising annual conference proceedings containing research findings on cholera, conducted between the years 1960-2011 under the "US-Japan Cooperative Medical Science Program" (CMSP). This program was established to address health problems in Southeast Asia and other developing countries. An R&D information management system developed at NLM, called "System for the Preservation of Electronic Resources " (SPER), automatically extracted descriptive metadata from this text corpus and built a DSpace-based archive for accessing the conference articles. SPER also used this metadata to get detailed information regarding the CMSP research community, timelines of important drugs and discoveries and international collaboration, etc., using special purpose data analysis software. In this paper, we describe the occurrence and extraction of metadata from the CMSP document set, and present an alternative approach in which this metadata is used to build a knowledgebase to support semantic queries about the CMSP Program. Specifically, we show the OWL-based hierarchical ontology model created to represent the CMSP Program with its publications, participants and international collaboration over time. We discuss the technique used to convert the extracted metadata from relational database tables to OWL/RDF assertions suitable for supporting semantic queries. We show examples of queries performed against this CMSP knowledgebase, and discuss some scalability issues. Finally we describe how this approach could be customized for other large textual collections, including one from the Food and Drug Administration previously archived by the SPER system.
机译:从文档内容中手动或自动收集的描述性元数据(例如文章的标题,作者,机构的隶属关系,关键字和发布日期)通常用于搜索和检索已存档集合中的相关文档。此元数据,尤其是用于大型文本语料库(例如生物医学收藏)的元数据,可以封装模式,趋势和其他有价值的信息,通常通过使用专门的数据分析软件来回答特定问题来揭示这些信息。一种更有用的通用方法是重新设置此元数据的用途,以用作知识库,以回答适当的语义查询。在美国国家医学图书馆(NLM),我们最近在1960-2011年之间根据“美日合作医学计划”(CMSP)进行了存档,其中包括年度会议论文集,其中包含有关霍乱的研究结果,这些生物医学文献集包括其中。制定该计划的目的是解决东南亚和其他发展中国家的健康问题。 NLM开发的R&D信息管理系统称为“电子资源保存系统”(SPER),该系统自动从该文本语料库中提取描述性元数据,并建立了一个基于DSpace的存档来访问会议文章。 SPER还使用专用的数据分析软件,使用此元数据获取有关CMSP研究社区,重要药物和发现的时间表以及国际合作等的详细信息。在本文中,我们描述了CMSP文档集中元数据的发生和提取,并提出了一种替代方法,其中该元数据用于构建知识库以支持有关CMSP程序的语义查询。具体而言,我们展示了基于OWL的分层本体模型,该模型是为代表CMSP计划及其出版物,参与者和国际合作而创建的。我们讨论了用于将提取的元数据从关系数据库表转换为适合支持语义查询的OWL / RDF断言的技术。我们显示了针对此CMSP知识库执行的查询的示例,并讨论了一些可伸缩性问题。最后,我们描述了如何针对其他大型文本集(包括之前由SPER系统存档的食品和药物管理局提供的)定制这种方法。

著录项

  • 来源
    《Archiving 2013》|2013年|193-199|共7页
  • 会议地点 Washington DC(US)
  • 作者单位

    National Library of Medicine, Bethesda, Maryland, USA;

    National Library of Medicine, Bethesda, Maryland, USA;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-26 14:07:44

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号