首页> 外文会议>International conference on computational linguistics >N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language
【24h】

N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language

机译:基于N-GRAM和GazetheeR列表基于URDU的命名实体识别:稀缺的资源语言

获取原文

摘要

Extraction of named entities (NEs) from the text is an important operation in many natural language processing applications like information extraction, question answering, machine translation etc. Since early 1990s the researchers have taken greater interest in this field and a lot of work has been done regarding Named Entity Recognition (NER) in different languages of the world. Unfortunately Urdu language which is a scarce resourced language has not been taken into account. In this paper we present a statistical Named Entity Recognition (NER) system for Urdu language using two basic n-gram models, namely unigram and bigram. We have also made use of gazetteer lists with both techniques as well as some smoothing techniques with bigram NER tagger. This NER system is capable to recognize 5 classes of NEs using a training data containing 2313 NEs and test data containing 104 NEs. The unigram NER Tagger using gazetteer lists achieves up to 65.21% precision, 88.63% recall and 75.14% f-measure. While the bigram NER Tagger using gazetteer lists and Backoff smoothing achieves up to 66.20% precision, 88.18% recall and 75.83 f-measure.
机译:从文本中提取名称实体(NES)是许多自然语言处理应用中的重要操作,如信息提取,问题应答,机器翻译等自20世纪90年代初以来,研究人员对该领域的兴趣提高了兴趣,并且有很多工作完成了不同语言的命名实体识别(ner)。不幸的是,乌尔都语是一种稀缺资源语言的语言。在本文中,我们向Unigram和Bigram提供了一种用于Urdu语言的统计名为实体识别(NER)系统,即Unigram和Bigram。我们还使用了Gazeteer列表,两种技术以及带有Bigram Ner标签的一些平滑技术。该NER系统能够使用包含2313 NE的训练数据和包含104个NE的测试数据识别5类NE。 Unigram Ner标签使用Gazeteer列表的精度高达65.21%,召回88.63%召回和75.14%F测量。虽然Bigram Ner标签使用Gazeteer列表和退避平滑,但精度高达66.20%,88.18%召回和75.83 F定量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号