首页> 外文期刊>International journal of computer processing of languages >Named Entity Recognition in Indian Languages Using Maximum Entropy Approach
【24h】

Named Entity Recognition in Indian Languages Using Maximum Entropy Approach

机译:使用最大熵方法的印度语言中的命名实体识别

获取原文
获取原文并翻译 | 示例

摘要

This paper reports about the development of a Named Entity Recognition (NER) system in Indian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu using the statistical Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task for South and South East Asian Languages (NERSSEAL) and tagged with the twelve NE tags. An appropriate tag conversion routine has been developed in order to convert these corpora to the forms, tagged with four NE tags, namely Person name, Location name, Organization name and Miscellaneous name. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features as well as language specific features. Language independent features include the contextual words, prefixes and suffixes of all the words in the training corpus, several digit features depending upon the presence and/or the number of digits in a token, first word of the sentence and the frequency features of the words. The system considers linguistic features, particularly for Bengali and Hindi. Linguistic features of Bengali include the set of known suffixes that may appear with NEs, clue words that help in predicting the location and organization names, words that help to recognize measurement expressions, designation words that help to identify person names, various gazetteer lists like the first names, middle names, last names, location names, organization names, function words, month names, weekdays, etc. As part of linguistic featuresrnfor Hindi, the system uses only the lists of first names, middle names, last names, function words, month names and weekdays along with the list of words that helps to recognize measurements. In addition to the other features, part of speech (POS) information of the word has been also considered for Bengali and Hindi. No linguistic features have been considered for Telugu, Oriya and Urdu. It has been observed from the evaluation results that the use of linguistic features improves the performance of the system. The system has been trained with 122,467 Bengali, 502,974 Hindi, 64,026 Telugu, 93,173 Oriya and 35,447 Urdu tokens. The system has demonstrated the highest overall average Recall, Precision, and F-Score values of 88.01%, 82.63%, and 85.22%, respectively, for Bengali with the 10-fold cross validation test. Experimental results of the 10-fold cross validation tests on the Hindi, Telugu, Oriya, and Urdu data have shown the overall average F-Score values of 82.66%, 70.11%, 70.13%, and 69.3%, respectively.
机译:本文报道了使用统计最大熵(ME)框架开发的印度语言(特别是孟加拉语,印地语,泰卢固语,奥里亚语和乌尔都语)命名实体识别(NER)系统的开发情况。我们使用了带注释的语料库,该语料库是从IJCNLP-08 NER东南亚和东南亚语言共享任务(NERSSEAL)获得的,并带有十二个NE标签进行了标记。为了将这些语料库转换为带有四个NE标签(即人名,位置名,组织名和杂项名)的表单,已经开发了一种适当的标签转换例程。该系统利用了单词的不同上下文信息以及各种正交字级特征,这些特征有助于预测四个NE类。在这项工作中,我们考虑了语言无关的功能以及特定于语言的功能。与语言无关的功能包括上下文单词,训练语料库中所有单词的前缀和后缀,取决于标记中数字的存在和/或位数的几个数字功能,句子的第一个单词和单词的频率特征。该系统考虑了语言功能,尤其是孟加拉语和北印度语。孟加拉语的语言功能包括可能与NE一起出现的一组已知后缀,有助于预测位置和组织名称的线索词,有助于识别度量表达的词,有助于识别人员姓名的指定词,各种地名词典(例如名字,中间名,姓氏,位置名称,组织名称,功能词,月份名称,工作日等。作为印地语的语言功能的一部分,系统仅使用名字,中间名,姓氏,功能词的列表,月份名称和工作日,以及有助于识别度量的单词列表。除其他功能外,孟加拉语和北印度语也考虑了单词的词性(POS)信息。泰卢固语,奥里亚语和乌尔都语没有考虑任何语言特性。从评估结果可以看出,使用语言功能可改善系统的性能。该系统已使用122,467孟加拉语,502,974印地语,64,026泰卢固语,93,173奥里亚语和35,447乌尔都语令牌进行了培训。该系统通过10倍交叉验证测试,显示了孟加拉语的最高总体平均召回率,精确度和F分数分别为88.01%,82.63%和85.22%。在印地语,泰卢固语,奥里亚语和乌尔都语数据上进行10倍交叉验证测试的实验结果表明,总体平均F分数分别为82.66%,70.11%,70.13%和69.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号