首页> 美国卫生研究院文献>other >NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization

【2h】

NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization

机译：NCBI疾病语料库：疾病名称识别和概念规范化的资源

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora.This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH^®) or Online Mendelian Inheritance in Man (OMIM^®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.The public release of the NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.

机译：生物医学文献出版物中以自然语言编码的信息仅在可获得访问和分析该信息的有效且可靠方式的情况下才有用。因此，自然语言处理和文本挖掘工具对于提取有价值的信息必不可少，但是，开发自动检测中央生物医学概念（例如疾病）的强大，高效工具取决于注释语料库的可用性。本文介绍了疾病名称和NCBI疾病语料库的概念注释，它包含793种PubMed摘要，在提及和概念级别进行了完全注释，可作为生物医学自然语言处理社区的研究资源。每个PubMed摘要均由两名注释者手动注释，这些注释带有疾病提及及其在医学主题标题（MeSH ^{®）或在线孟德尔遗传（OMIM ^{®）中的相应概念。使用PubTator进行手动管理，它允许使用预先注释作为手动注释的前置步骤。 14个注释者随机配对，并讨论了不同的注释，以便在两个注释阶段中达成共识。在这种情况下，观察者之间达成了很高的共识。最后，所有结果都与其余语料库的注释进行了核对，以确保整个语料库的一致性。NCBI疾病语料库的公开发布包含6,892个疾病提及，它们映射到790个独特的疾病概念。其中88％链接到MeSH标识符，其余的包含OMIM标识符。我们能够将91％的提及与单个疾病概念相关联，而其余部分则被描述为概念的组合。为了帮助研究人员使用语料库设计和测试疾病识别方法，我们准备了语料库作为培训，测试和开发集。为了证明其实用性，我们进行了基准测试，比较了三种不同的基于知识的疾病归一化方法，这些方法在F-措施中的最佳性能为63.7％。这些结果表明，NCBI疾病语料库可以通过提供高质量的金标准来显着改善疾病名称识别和规范化研究的最新水平，从而有可能开发基于机器学习的方法任务。}}

著录项

期刊名称 other
作者
Rezarta Islamaj Doğan; Robert Leaman; Zhiyong Lu;
展开▼
作者单位

展开▼
年(卷),期 -1(47),-1
年度 -1
页码 1–10
总页数 28
原文格式 PDF
正文语种
中图分类
关键词
Disease name recognition Named entity recognition Disease name normalization Corpus annotation Disease name corpus;

机译：疾病名称识别;命名实体识别;疾病名称规范化;语料库注释;疾病名称语料库;

相似文献

外文文献
中文文献
专利

1. NCBI disease corpus: A resource for disease name recognition and concept normalization [J] . Rezarta Islamaj Dogan, Robert Leaman, Zhiyong Lu Journal of biomedical informatics. . 2014,第1期

机译：NCBI疾病语料库：疾病名称识别和概念标准化的资源
2. Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus [J] . Donald C. Comeau, Haibin Liu, Rezarta Islamaj Do?an, Database . 2014,第1期

机译：自然语言处理管道可用于注释BioC集合并应用于NCBI疾病语料库
3. Diagnostic support for selected neuromuscular diseases using answer-pattern recognition and data mining techniques: a proof of concept multicenter prospective trial [J] . Lorenz Grigull, Werner Lechner, Susanne Petri, BMC Medical Informatics and Decision Making . 2016,第1期

机译：使用答案模式识别和数据挖掘技术为选定的神经肌肉疾病提供诊断支持：概念验证的多中心前瞻性试验
4. Respiratory Odor Recognition and Disease Diagnosis Method of Gastric Cancer Based on the Concept of Combination of Disease and Syndrome [C] . Fen Xie Italian Association of Chemical Engineering;International Conference on Environmental Odour Monitoring and Control . 2018

机译：基于病证结合概念的胃癌呼吸异味识别与诊断方法
5. Molecular dynamics simulations of collagen model peptides: Implication for collagen diseases and recognition. [D] . Fu, Iwen. 2015

机译：胶原蛋白模型肽的分子动力学模拟：对胶原蛋白疾病和识别的意义。
6. Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus [O] . Donald C. Comeau, Haibin Liu, Rezarta Islamaj Doğan, 2014

机译：自然语言处理管道可用于注释BioC集合并应用于NCBI疾病语料库
7. NCBI disease corpus: A resource for disease name recognition and concept normalization [O] . Doğan Rezarta Islamaj, Leaman Robert, Lu Zhiyong 2014

机译：NCBI疾病语料库：疾病名称识别和概念规范化的资源

NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization

摘要

著录项

相似文献

相关主题

期刊订阅