首页> 美国卫生研究院文献>Journal of Cheminformatics >The CHEMDNER corpus of chemicals and drugs and its annotation principles
【2h】

The CHEMDNER corpus of chemicals and drugs and its annotation principles

机译:CHEMDNER化学品和药物语料库及其注释原则

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at:
机译:从文本中自动提取化学信息要求识别化学实体提及是其关键步骤之一。在开发监督命名实体识别(NER)系统时,需要使用大型的手动注释文本语料库。此外,大型语料库可以对检测文档中化学物质的不同方法进行可靠的评估和比较。我们提供CHEMDNER语料库,这是10,000个PubMed摘要的集合,其中包含总共84,355个化学实体提及,这些化学提及由专家化学文献管理者手动标记,并遵循为此任务专门定义的注释准则。选择CHEMDNER语料库的摘要作为所有主要化学学科的代表。每个化学实体提及均根据其与结构相关的化学实体提及(SACEM)类进行手动标记:缩写,族,式,标识符,多个,系统的和琐碎的。使用注释者之间的协议研究测量了在文本中标记化学药品的难度和一致性,得出百分比协议为91。对于CHEMDNER语料库的一部分(3,000个摘要的测试集),我们不仅提供了Gold Standard手册注释,但也会由参加BioCreative IV CHEMDNER化学提要识别任务的26个团队自动检测到提要。此外,我们发布了CHEMDNER银标准语料库,该语料库可从17,000个随机选择的PubMed摘要中自动提取提及。还生成了BioC格式的CHEMDNER语料库的版本。我们为有关化学和药物实体领域特定语料库构建的实体注释所需的最少信息提出了一个标准。 CHEMDNER语料库和注释准则可从以下网站获得:

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号