首页> 外文会议>Workshop on biomedical natural language processing >Towards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering Approach
【24h】

Towards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering Approach

机译:使用过滤方法从稀有和歧义缩写中获取基因识别

获取原文

摘要

Retrieving information about highly ambiguous gene/protein homonyms is a challenge, in particular where their non-protein meanings are more frequent than their protein meaning (e. g., SAH or HF). Due to their limited coverage in common benchmarking data sets, the performance of existing gene/protein recognition tools on these problematic cases is hard to assess. We uniformly sample a corpus of eight ambiguous gene/protein abbreviations from Medline® and provide manual annotations for each mention of these abbreviations.1 Based on this resource, we show that available gene recognition tools such as conditional random fields (CRF) trained on BioCreative 2 NER data or GNAT tend to underperform on this phenomenon. We propose to extend existing gene recognition approaches by combining a CRF and a support vector machine. In a cross-entity evaluation and without taking any entity-specific information into account, our model achieves a gain of 6 points F_1-Measure over our best baseline which checks for the occurrence of a long form of the abbreviation and more than 9 points over all existing tools investigated.
机译:检索关于高度模棱两可的基因/蛋白质同义字的信息是一个挑战,特别是在它们的非蛋白质含义比其蛋白质含义(例如,SAH或HF)更频繁的情况下。由于它们在通用基准数据集中的覆盖范围有限,因此难以评估现有基因/蛋白质识别工具在这些有问题的情况下的性能。我们从Medline®统一采样了8个歧义的基因/蛋白质缩写的语料库,并为每一个提及的缩写提供了手动注释。1基于此资源,我们显示了可用的基因识别工具,例如在BioCreative上训练的条件随机场(CRF) 2 NER数据或GNAT在此现象上往往表现不佳。我们建议通过结合CRF和支持向量机来扩展现有的基因识别方法。在跨实体评估中,在不考虑任何特定于实体的信息的情况下,我们的模型在最佳基线上获得了6分F_1-测量,该最佳基线检查了较长形式的缩写的发生,而在9%以上的情况下对所有现有工具进行了调查。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号