...
首页> 外文期刊>BMC Bioinformatics >How to make the most of NE dictionaries in statistical NER
【24h】

How to make the most of NE dictionaries in statistical NER

机译:如何在统计NER中充分利用NE词典

获取原文
           

摘要

Background When term ambiguity and variability are very high, dictionary-based Named Entity Recognition ( NER ) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity ( NE ) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. Methods We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech ( POS ) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached. Results We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. Conclusion Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER.
机译:背景技术当术语的歧义性和可变性很高时,即使有大量的术语资源,基于字典的命名实体识别(NER)也不是理想的解决方案。关于统计NER的许多研究都试图解决这些问题。但是,如何利用统计NER中的现有和其他命名实体(NE)字典并非易事。据推测,将NE添加到NE字典可以带来更好的性能。但是,实际上,需要重新训练NER模型才能实现此目的。我们选择蛋白质名称识别作为案例研究,因为它最容易遭受与长期变化和歧义有关的问题。方法我们建立了一种通过将NE添加到NE字典而无需重新训练来提高NER性能的新颖方法。在我们的方法中,首先,基于通用词词典和NE词典,与词性(POS)标记并行识别已知的NE。然后,在带有正确NE标签的POS / PROTEIN标签输出上训练统计NER。结果我们根据标准JNLPBA-2004数据集评估了NER的性能。将训练数据中出现的蛋白质名称添加到POS标签词典中之后,无需进行任何模型再训练,测试集上的F评分已从73.14提高到73.78。使用测试集蛋白质名称丰富标签字典后,性能进一步提高到78.72。结论我们的方法已经证明了在蛋白质名称识别方面的高性能,这表明如何在统计NER中充分利用已知的NE。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号