首页> 外文会议>PAKDD(Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining) 2007 International Workshops; 20070522; Nanjing(CN) >Incorporating Dictionary Features into Conditional Random Fields for Gene/Protein Named Entity Recognition
【24h】

Incorporating Dictionary Features into Conditional Random Fields for Gene/Protein Named Entity Recognition

机译:将字典特征合并到条件随机字段中以进行基因/蛋白质命名的实体识别

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Biomedical Named Entity Recognition (BioNER) is an important preliminary step for biomedical text mining. Previous researchers built dictionaries of gene/protein names from online databases and incorporated them into machine learning models as features, but the effects were very limited. This paper gives a quality assessment of four dictionaries derived form online resources, and investigate the impacts of two factors (i.e., dictionary coverage and noisy terms) that may lead to the poor performance of dictionary features. Experiments are performed by comparing performances of the external dictionaries and a dictionary derived from GENETAG corpus, using Conditional Random Fields (CRFs) with dictionary features. We also make observations of the impacts regarding long names and short names. The results show that low coverage of long names and noises of short names are the main problems of current online resources and a high quality dictionary could substantially improve the accuracy of BioNER.
机译:生物医学命名实体识别(BioNER)是生物医学文本挖掘的重要的初步步骤。先前的研究人员从在线数据库中构建了基因/蛋白质名称的字典,并将其作为特征整合到机器学习模型中,但效果非常有限。本文对从在线资源派生的四个词典进行了质量评估,并研究了可能导致词典功能不佳的两个因素(即词典覆盖率和嘈杂术语)的影响。通过使用带有字典功能的条件随机字段(CRF),比较外部字典和GENETAG语料库衍生的字典的性能来进行实验。我们还观察了有关长名和短名的影响。结果表明,低覆盖的长名和短名的噪音是当前在线资源的主要问题,高质量的词典可以大大提高BioNER的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号