The CHEMDNER corpus of chemicals and drugs and its annotation principles

机译：CHEMDNER化学品和药物语料库及其注释原则

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at:

机译：从文本中自动提取化学信息要求识别化学实体提及是其关键步骤之一。在开发监督命名实体识别（NER）系统时，需要使用大型的手动注释文本语料库。此外，大型语料库可以对检测文档中化学物质的不同方法进行可靠的评估和比较。我们提供CHEMDNER语料库，这是10,000个PubMed摘要的集合，其中包含总共84,355个化学实体提及，这些化学提及由专家化学文献管理者手动标记，并遵循为此任务专门定义的注释准则。选择CHEMDNER语料库的摘要作为所有主要化学学科的代表。每个化学实体提及均根据其与结构相关的化学实体提及（SACEM）类进行手动标记：缩写，族，式，标识符，多个，系统的和琐碎的。使用注释者之间的协议研究测量了在文本中标记化学药品的难度和一致性，得出百分比协议为91。对于CHEMDNER语料库的一部分（3,000个摘要的测试集），我们不仅提供了Gold Standard手册注释，但也会由参加BioCreative IV CHEMDNER化学提要识别任务的26个团队自动检测到提要。此外，我们发布了CHEMDNER银标准语料库，该语料库可从17,000个随机选择的PubMed摘要中自动提取提及。还生成了BioC格式的CHEMDNER语料库的版本。我们为有关化学和药物实体领域特定语料库构建的实体注释所需的最少信息提出了一个标准。 CHEMDNER语料库和注释准则可从以下网站获得：

著录项

期刊名称 Journal of Cheminformatics
作者
Martin Krallinger; Obdulia Rabal; Florian Leitner; Miguel Vazquez; David Salgado; Zhiyong Lu; Robert Leaman; Yanan Lu; Donghong Ji; Daniel M Lowe; Roger A Sayle; Riza Theresa Batista-Navarro; Rafal Rak; Torsten Huber; Tim Rocktäschel; Sérgio Matos; David Campos; Buzhou Tang; Hua Xu; Tsendsuren Munkhdalai; Keun Ho Ryu; SV Ramanan; Senthil Nathan; Slavko Žitnik; Marko Bajec; Lutz Weber; Matthias Irmer; Saber A Akhondi; Jan A Kors; Shuo Xu; Xin An; Utpal Kumar Sikdar; Asif Ekbal; Masaharu Yoshioka; Thaer M Dieb; Miji Choi; Karin Verspoor; Madian Khabsa; C Lee Giles; Hongfang Liu; Komandur Elayavilli Ravikumar; Andre Lamurias; Francisco M Couto; Hong-Jie Dai; Richard Tzong-Han Tsai; Caglar Ata; Tolga Can; Anabel Usié; Rui Alves; Isabel Segura-Bedmar; Paloma Martínez; Julen Oyarzabal; Alfonso Valencia;
展开▼
作者单位

展开▼
年(卷),期 2015(7),Suppl 1
年度 2015
页码 S2
总页数 17
原文格式 PDF
正文语种
中图分类生化遗传学;生化药理学;
关键词
named entity recognition BioCreative text mining chemical entity recognition machine learning chemical indexing ChemNLP;

机译：命名实体识别;BioCreative;文本挖掘;化学实体识别;机器学习;化学索引;ChemNLP;

相似文献

外文文献
中文文献
专利

1. The CHEMDNER corpus of chemicals and drugs and its annotation principles [J] . Martin Krallinger, Obdulia Rabal, Florian Leitner, Journal of Cheminformatics . 2015,第S1期

机译：CHEMDNER化学药品的语料库及其注释原则
2. CHEMDNER: The drugs and chemical names extraction challenge [J] . Martin Krallinger, Florian Leitner, Obdulia Rabal, Journal of Cheminformatics . 2015,第S1期

机译：CHEMDNER：药物和化学名称提取挑战
3. CADEC: A corpus of adverse drug event annotations [J] . Karimi Sarvnaz, Metke-Jimenez Alejandro, Kemp Madonna, Journal of biomedical informatics. . 2015,第Null期

机译：CADEC：药品不良事件注释集
4. Annotation of 'Word List by Semantic Principles' Labels for the Balanced Corpus of Contemporary Written Japanese [C] . Sachi Kato, Masayuki Asahara, Makoto Yamazaki Pacific Asia Conference on Language, Information and Computation . 2018

机译：现代日语日语平衡语料库的“语义原则词表”标签注释
5. Facilitating Corpus Annotation by Improving Annotation Aggregation [D] . Felt, Paul Lewis. 2015

机译：通过改进注释聚合来促进语料库注释
6. CHEMDNER: The drugs and chemical names extraction challenge [O] . Martin Krallinger, Florian Leitner, Obdulia Rabal, 2015

机译：CHEMDNER：药品名称的挑战
7. The CHEMDNER corpus of chemicals and drugs and its annotation principles [O] . Krallinger, M. (Martin), Rabal, O. (Obdulia), Leitner, F. (Florian), 2015

机译：CHEMDNER化学品和药物语料库及其注释原则

The CHEMDNER corpus of chemicals and drugs and its annotation principles

摘要

著录项

相似文献

相关主题

期刊订阅