首页> 外文期刊>BMC Bioinformatics >Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation
【24h】

Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation

机译:用标签重新校正和知识蒸馏改善生物医学命名实体识别的召回

获取原文
       

摘要

Biomedical named entity recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotated datasets, especially the limited knowledge contained in them. To remedy the above issue, we propose a novel Biomedical Named Entity Recognition (BioNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance recognition model. Our framework is inspired by two points: (1) named entity recognition should be considered from the perspective of both coverage and accuracy; (2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large-scale unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another weakly labeled dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two recognition models, respectively. Finally, we compress the knowledge in the two models into a single recognition model with knowledge distillation. Experiments on the BioCreative V chemical-disease relation corpus and NCBI Disease corpus show that knowledge from large-scale datasets significantly improves the performance of BioNER, especially the recall of it, leading to new state-of-the-art results. We propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for BioNER.
机译:生物医学命名实体识别是生物医学信息提取中最重要的任务之一。以前的研究遭受了不足的注释数据集,特别是其中包含的有限知识。为了解决上述问题,我们提出了一种新的生物医学命名实体识别(均质)框架,具有标签重新校正和知识蒸馏策略,这不仅可以创建大而高质量的数据集,还可以获得高性能识别模型。我们的框架受到了两点的启发:(1)应从覆盖和准确性的角度考虑命名实体识别; (2)应通过迭代修正产生可信的注释。首先,对于覆盖范围,我们通过Pubtator向大规模未标记的数据集中注释化学和疾病实体,以生成弱标记的数据集。为了准确,我们通过利用多个知识库来生成另一个弱标记的数据集来过滤它。接下来,通过标签重新校正策略修订两个数据集以构建两个高质量的数据集,其分别用于培训两个识别模型。最后,我们将两种模型中的知识压缩到具有知识蒸馏的单个识别模型中。生物重建V化学疾病关系中的实验和NCBI疾病语料库表明,大规模数据集的知识显着提高了矿器的性能,尤其是召回它,导致新的最先进的结果。我们提出了一个标签重新修正和知识蒸馏策略的框架。比较结果表明,两个重新校正的数据集中知识的两个视角分别是互补的,均有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号