首页> 外文会议>Conference on empirical methods in natural language processing >Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets
【24h】

Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets

机译:BiLSTM-CRF对不连续标签集进行生物医学命名实体识别的边缘可能性训练

获取原文

摘要

Extracting typed entity mentions from text is a fundamental component to language understanding and reasoning. While there exist substantial labeled text datasets for multiple subsets of biomedical entity types-such as genes and proteins, or chemicals and diseases-it is rare to find large labeled datasets containing labels for all desired entity types together. This paper presents a method for training a single CRF extractor from multiple datasets with disjoint or partially overlapping sets of entity types. Our approach employs marginal likelihood training to insist on labels that arc present in the data, while filling in "missing labels". This allows us to leverage all the available data within a single model. In experimental results on the Biocre-ative Ⅴ CDR (chemicals/diseases), Biocreative Ⅵ ChemProt (chemicals/proteins) and Med-Mentions (19 entity types) datasets, we show that joint training on multiple datasets improves NER F1 over training in isolation, and our methods achieve state-of-the-art results.
机译:从文本中提取类型化的实体提及是语言理解和推理的基本组成部分。虽然存在大量生物医学实体类型的子集(例如基因和蛋白质,化学药品和疾病)的标记文本数据集,但是很少找到包含所有所需实体类型标记的大型标记数据集。本文提出了一种从实体类型不相交或部分重叠的多个数据集中训练单个CRF提取器的方法。我们的方法采用边际似然训练来坚持数据中存在的弧形标签,同时填写“缺失标签”。这使我们能够利用单个模型中的所有可用数据。在关于生物创造力ⅤCDR(化学物质/疾病),生物创造力ⅥChemProt(化学物质/蛋白质)和Med-Mentions(19种实体类型)数据集的实验结果中,我们表明,与单独训练相比,在多个数据集上联合训练可提高NER F1 ,我们的方法可以达到最新的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号