首页> 外文期刊>Computer speech and language >Generalisation in named entity recognition: A quantitative analysis
【24h】

Generalisation in named entity recognition: A quantitative analysis

机译:命名实体识别中的泛化:定量分析

获取原文
获取原文并翻译 | 示例
       

摘要

Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation.
机译:命名实体识别(NER)是NLP的一项关键任务,对于网络和用户生成的内容以及其不断变化的语言,这更具挑战性。本文旨在通过测量命名实体(NE)和上下文变异性,特征稀疏性及其对精度和召回率的影响,来量化这种多样性如何影响最新的NER方法。特别是,我们的研究结果表明,NER方法难以在训练数据有限的情况下推广到各种体裁中。尤其是看不见的NE发挥了重要作用,与社交媒体等更常规的流派相比,社交媒体等各种流派的NE发生率更高。再加上更普遍的看不见特征的发生率较高,以及缺少大型训练语料库,与较常规的F1分数相比,这导致显着降低F1分数。我们还发现,领先的系统严重依赖于训练数据中发现的表面形式,存在超出这些范围的一般性问题,并为此观察提供了解释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号