首页> 外文期刊>Artificial intelligence >Learning multilingual named entity recognition from Wikipedia
【24h】

Learning multilingual named entity recognition from Wikipedia

机译:从维基百科学习多语言命名实体识别

获取原文
获取原文并翻译 | 示例

摘要

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes. We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy. We transform the links between articles into ne annotations by projecting the target article's classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards. We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.
机译:通过利用Wikipedia的文本和结构,我们自动为命名实体识别(ner)自动创建大量,免费和多语言的银标准培训注释。大多数神经系统依靠注释数据的统计模型来识别文本中的人员,位置和组织的名称并将其分类。对昂贵注释的这种依赖是我们工作克服的知识瓶颈。我们首先将每篇Wikipedia文章分类为命名实体(ne)类型,在9200种跨9种语言手动标记的Wikipedia文章上进行培训和评估。我们的跨语言方法可实现高达95%的准确性。通过将目标文章的分类投影到锚文本上,我们将文章之间的链接转换为ne批注。这种方法会产生合理的注释,但不会立即与现有的黄金标准数据竞争。通过推断其他链接并启发式调整Wikipedia语料库,我们可以更好地使自动注释与黄金标准保持一致。我们用9种语言注释了数百万个单词,并根据conll共享任务数据和其他金标准语料库评估了英语,德语,西班牙语,荷兰语和俄语维基百科训练的模型。当在来自不同来源的评估语料库上进行测试时,我们的方法优于自动进行ne注释的其他方法(Richman和Schone,2008 [61],Mika等人,2008 [46])与黄金标准培训竞争。并且在人工注释的Wikipedia文本上比新闻通讯社训练的模型好10%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号