首页> 外文OA文献 >Recognizing named entities in biomedical texts
【2h】

Recognizing named entities in biomedical texts

机译:识别生物医学文本中的命名实体

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Named Entities (NEs) in biomedical text refer to objects that are of interest to biomedical researchers, such as proteins and genes. Accurately identifying them is important for Biomedical Natural Language Processing (BioNLP). Focusing on biomedical named entity recognition (BioNER), this thesis presents a number of novel results on the following topics of this area. First, we study whether corpus based statistical learning methods, currently dominant in BioNER, would achieve close-to-human performance by using larger corpora for training. We find that a significantly larger corpus is required to achieve a performance significantly higher than the state-of-the-art obtained on the GENIA corpus. This finding suggests the hypothesis is not warranted. Second, we address the issue of nested NEs and propose a level-by-level method that learns a separate NER model for each level of the nesting. We show that this method works well for both nested NEs and non-nested NEs. Third, we propose a method that builds NEs on top of base NP chunks, and examine the associated benefits as well as problems. Our experiments show that this method, though inferior to statistical word based approaches, has the potential to outperform them, provided that domain-specific rules can be designed to determine NE boundaries based on NP chunks. Fourth, we present a method to do BioNER in the absence of annotated corpora. It uses an NE dictionary to label sentences, and then uses these partially labeled sentences to iteratively train an SVM model in the manner of semi-supervised learning. Our experiments validate the effectiveness of the method. Finally, we explore BioNER in Chinese text, an area that has not been studied by previous work. We train a character-based CRF model on a small set of manually annotated Chinese biomedical abstracts. We also examine the features usable for the model. Our evaluation suggests that corpus-based statistical learning approaches hold promise for this particular task. All the proposed methods are novel and have applicability beyond the NE types and the languages considered here, and beyond the BioNER task itself.
机译:生物医学文本中的命名实体(NEs)指的是生物医学研究人员感兴趣的对象,例如蛋白质和基因。准确地识别它们对于生物医学自然语言处理(BioNLP)很重要。着重于生物医学命名实体识别(BioNER),本文提出了有关该领域以下主题的许多新颖结果。首先,我们研究目前在BioNER中占主导地位的基于语料库的统计学习方法是否可以通过使用较大的语料库进行训练来实现接近人类的表现。我们发现,要获得比GENIA语料库上获得的最新技术更高的性能,需要更大的语料库。这一发现表明这一假设是没有根据的。其次,我们解决嵌套网元的问题,并提出一种逐级方法,该方法为嵌套的每个级别学习单独的NER模型。我们证明了该方法对于嵌套网元和非嵌套网元都适用。第三,我们提出了一种在基本NP块之上构建NE的方法,并研究相关的好处和问题。我们的实验表明,该方法虽然不如基于统计词的方法,但有可能胜过基于统计词的方法,前提是可以设计特定于域的规则来确定基于NP块的NE边界。第四,我们提出了一种在没有注解语料库的情况下执行BioNER的方法。它使用NE词典标记句子,然后使用这些部分标记的句子以半监督学习的方式迭代地训练SVM模型。我们的实验验证了该方法的有效性。最后,我们探索中文版BioNER,这是以前的工作尚未研究的领域。我们在少量手动注释的中国生物医学摘要上训练了基于字符的CRF模型。我们还将检查可用于模型的功能。我们的评估表明,基于语料库的统计学习方法有望完成此特定任务。所有提出的方法都是新颖的,并具有适用于此处所考虑的网元类型和语言以及BioNER任务本身以外的适用性。

著录项

  • 作者

    Gu Baohua;

  • 作者单位
  • 年度 2008
  • 总页数
  • 原文格式 PDF
  • 正文语种 English
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号