首页> 外文期刊>ACM transactions on Asian language information processing >A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation
【24h】

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

机译:用于全词乌尔都语的词义注释语料库

获取原文
获取原文并翻译 | 示例

摘要

Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a given context. All human languages exhibit word sense ambiguity, and resolving this ambiguity can be difficult. Standard benchmark resources are required to develop, compare, and evaluate WSD techniques. These are available for many languages, but not for Urdu, despite this being a language with more than 300 million speakers and large volumes of text available digitally. To fill this gap, this study proposes a novel benchmark corpus for the Urdu All-Words WSD task. The corpus contains 5,042 words of Urdu running text in which all ambiguous words (856 instances) are manually tagged with senses from the Urdu Lughat dictionary. A range of baseline WSD models based on n-gram are applied to the corpus, and the best performance (accuracy of 57.71%) is achieved using word 4-gram. The corpus is freely available to the research community to encourage further WSD research in Urdu.
机译:词义消歧(WSD)旨在自动预测在给定上下文中使用的词的正确含义。所有人类语言都表现出单词意义上的歧义,解决这种歧义可能很困难。开发,比较和评估WSD技术需要标准基准资源。这些语言适用于多种语言,但不适用于乌尔都语,尽管这是一种拥有3亿多说话者和数字文本的语言。为了填补这一空白,本研究为乌尔都语全语WSD任务提出了一种新颖的基准语料库。语料库包含5,042个乌尔都语运行文本单词,其中所有歧义词(856个实例)均使用Urdu Lughat词典中的感官手动标记。将一系列基于n-gram的基线WSD模型应用于语料库,使用单词4-gram可获得最佳性能(准确性为57.71%)。研究团体可以免费获得该语料库,以鼓励在乌尔都语的WSD进一步研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号