首页> 外文期刊>Briefings in bioinformatics >Biomedical named entity recognition and linking datasets: survey and our recent development
【24h】

Biomedical named entity recognition and linking datasets: survey and our recent development

机译:生物医学命名实体识别和链接数据集:调查和我们最近的发展

获取原文
获取原文并翻译 | 示例
           

摘要

Natural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein-protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein-protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge.
机译:自然语言处理(NLP)广泛应用于生物域中,以从出版物中检索信息。解决众多应用的系统存在,例如生物医学命名实体识别(BNER),命名实体归一化(NEN)和蛋白质 - 蛋白质相互作用提取(PPIE)。高质量的数据集可以帮助开发强大和可靠的系统;但是,由于无穷无尽的应用程序和不断发展的技术,基准数据集的注释可能已过时和不合适。在这项研究中,我们首先审查常用的BNER数据集及其潜在的注释问题,例如不一致和低便携性。然后,我们介绍了一个修订的JNLPBA数据集版,解决了原始和使用最先进的实体识别系统中的潜在问题,以评估其对不同种类的生物医学文献的便携性,包括蛋白质 - 蛋白质相互作用和生物学事件。最后,我们通过使用PubMed Central全文段落,图标题和专利摘要扩展了Refed JNLPBA数据集来介绍合奏的生物医学实体数据集(EBED)。该电子设备是一个多任务数据集,涵盖包括基因,疾病和化学实体的注释。总共包含85000实体提到,使用数据库标识符和5000个属性标记提出了25000个实体。为了展示电子设备的使用情况,我们从AI杯生物医学纸张分析挑战中审查了BNER轨道。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号