N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language

机译：基于N-GRAM和GazetheeR列表基于URDU的命名实体识别：稀缺的资源语言

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Extraction of named entities (NEs) from the text is an important operation in many natural language processing applications like information extraction, question answering, machine translation etc. Since early 1990s the researchers have taken greater interest in this field and a lot of work has been done regarding Named Entity Recognition (NER) in different languages of the world. Unfortunately Urdu language which is a scarce resourced language has not been taken into account. In this paper we present a statistical Named Entity Recognition (NER) system for Urdu language using two basic n-gram models, namely unigram and bigram. We have also made use of gazetteer lists with both techniques as well as some smoothing techniques with bigram NER tagger. This NER system is capable to recognize 5 classes of NEs using a training data containing 2313 NEs and test data containing 104 NEs. The unigram NER Tagger using gazetteer lists achieves up to 65.21% precision, 88.63% recall and 75.14% f-measure. While the bigram NER Tagger using gazetteer lists and Backoff smoothing achieves up to 66.20% precision, 88.18% recall and 75.83 f-measure.

机译：从文本中提取名称实体（NES）是许多自然语言处理应用中的重要操作，如信息提取，问题应答，机器翻译等自20世纪90年代初以来，研究人员对该领域的兴趣提高了兴趣，并且有很多工作完成了不同语言的命名实体识别（ner）。不幸的是，乌尔都语是一种稀缺资源语言的语言。在本文中，我们向Unigram和Bigram提供了一种用于Urdu语言的统计名为实体识别（NER）系统，即Unigram和Bigram。我们还使用了Gazeteer列表，两种技术以及带有Bigram Ner标签的一些平滑技术。该NER系统能够使用包含2313 NE的训练数据和包含104个NE的测试数据识别5类NE。 Unigram Ner标签使用Gazeteer列表的精度高达65.21％，召回88.63％召回和75.14％F测量。虽然Bigram Ner标签使用Gazeteer列表和退避平滑，但精度高达66.20％，88.18％召回和75.83 F定量。

著录项

来源
《International conference on computational linguistics》|2012年||共10页
会议地点
作者
Faryal Jahangir; Waqas Anwar; Usama Ijaz Bajwa; Xuan Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序语言、算法语言;
关键词
Named Entity Recognition; Unigram model; Bigram model; Gazetteer lists; smoothing techniques;

机译：命名实体识别;UNIGRAM模型;BIGRAM模型;宪录列表;平滑技术;

相似文献

外文文献
中文文献
专利

1. Challenges of Urdu Named Entity Recognition: A Scarce Resourced Language [J] . Saeeda Naz, Arif Iqbal Umar, Syed Hamad Shirazi, Research journal of applied science, engineering and technology . 2014,第10期

机译：乌尔都语命名实体识别的挑战：一种稀缺的资源语言
2. Challenges of Urdu Named Entity Recognition: A Scarce Resourced Language [J] . Saeeda Naz, Arif Iqbal Umar, Syed Hamad Shirazi, Research journal of applied science, engineering and technology . 2014,第10期

机译：乌尔都语命名实体识别的挑战：一种稀缺的资源语言
3. Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields [J] . K.P. Pallavi, L. Sobha, M.M. Ramya Journal of computer sciences . 2018,第5期

机译：使用带有条件随机字段的地名词典列表将其命名为卡纳达语实体识别
4. N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language [C] . Faryal Jahangir, Waqas Anwar, Usama Ijaz Bajwa, 10th workshop on Asian language resources . 2012

机译：基于N-gram和地名词典列表的乌尔都语命名实体识别：一种稀缺资源语言
5. Improving Search via Named Entity Recognition in Morphologically Rich Languages: A Case Study in Urdu [D] . Riaz, Kashif H. 2018

机译：通过形态丰富的语言中的命名实体识别来改善搜索：以乌尔都语为例
6. Semi-Supervised Bidirectional Long Short-Term Memory and Conditional Random Fields Model for Named-Entity Recognition Using Embeddings from Language Models Representations [O] . Min Zhang, Guohua Geng, Jing Chen 2020

机译：使用语言模型表示的嵌入式识别命名实体识别的半监控双向短期内存和条件随机字段模型
7. Named Entity Recognition using Gazetteer Method and N-gram Technique for an Inflectional Language: A Hybrid Approach [O] . Arindam Dey 2014

机译：使用地名词法和N-gram技术进行屈折语言的命名实体识别：一种混合方法

N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language

摘要

著录项

相似文献

相关主题

期刊订阅