首页> 美国卫生研究院文献>other >NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization
【2h】

NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization

机译:NCBI疾病语料库:疾病名称识别和概念规范化的资源

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora.This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.The public release of the NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.
机译:生物医学文献出版物中以自然语言编码的信息仅在可获得访问和分析该信息的有效且可靠方式的情况下才有用。因此,自然语言处理和文本挖掘工具对于提取有价值的信息必不可少,但是,开发自动检测中央生物医学概念(例如疾病)的强大,高效工具取决于注释语料库的可用性。本文介绍了疾病名称和NCBI疾病语料库的概念注释,它包含793种PubMed摘要,在提及和概念级别进行了完全注释,可作为生物医学自然语言处理社区的研究资源。每个PubMed摘要均由两名注释者手动注释,这些注释带有疾病提及及其在医学主题标题(MeSH ®)或在线孟德尔遗传(OMIM ®)中的相应概念。使用PubTator进行手动管理,它允许使用预先注释作为手动注释的前置步骤。 14个注释者随机配对,并讨论了不同的注释,以便在两个注释阶段中达成共识。在这种情况下,观察者之间达成了很高的共识。最后,所有结果都与其余语料库的注释进行了核对,以确保整个语料库的一致性。NCBI疾病语料库的公开发布包含6,892个疾病提及,它们映射到790个独特的疾病概念。其中88%链接到MeSH标识符,其余的包含OMIM标识符。我们能够将91%的提及与单个疾病概念相关联,而其余部分则被描述为概念的组合。为了帮助研究人员使用语料库设计和测试疾病识别方法,我们准备了语料库作为培训,测试和开发集。为了证明其实用性,我们进行了基准测试,比较了三种不同的基于知识的疾病归一化方法,这些方法在F-措施中的最佳性能为63.7%。这些结果表明,NCBI疾病语料库可以通过提供高质量的金标准来显着改善疾病名称识别和规范化研究的最新水平,从而有可能开发基于机器学习的方法任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号