首页> 外文期刊>ACM transactions on Asian language information processing >BenLem (A Bengali Lemmatizer) and Its Role in WSD
【24h】

BenLem (A Bengali Lemmatizer) and Its Role in WSD

机译:BenLem(孟加拉国放血剂)及其在WSD中的作用

获取原文
获取原文并翻译 | 示例
           

摘要

A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the frequent morphological variations of the root words appearing in the text. Therefore, a lemmatizer is essential for developing natural language processing (NLP) tools for such languages. In this experiment, Bengali, which is the national language of Bangladesh and the second most popular language in the Indian subcontinent, has been taken as a reference. In order to design the Bengali lemmatizer (named as BenLem), possible transformations through which surface words are formed from lemmas are studied so that appropriate reverse transformations can be applied on a surface word to get the corresponding lemma back. BenLem is found to be capable of handling both inflectional and derivational morphology in Bengali. It is evaluated on a set of 18 news articles taken from the FIRE Bengali News Corpus consisting of 3,342 surface words (excluding proper nouns) and found to be 81.95% accurate. The role of the lemmatizer is then investigated for Bengali WSD. Ten highly polysemous Bengali words are considered for sense disambiguation. The FIRE corpus and a collection of Tagore's short stories are considered for creating the WSD dataset. Different WSD systems are considered for this experiment, and it is noticed that BenLem improves the performance of all the WSD systems and the improvements are statistically significant.
机译:已经开发和评估了孟加拉语的lemmatization算法。还研究了其对词义消歧(WSD)的有效性。计算机处理高变形语言的主要挑战之一是处理出现在文本中的词根的频繁形态变化。因此,lemmatizer对于开发用于此类语言的自然语言处理(NLP)工具至关重要。在该实验中,孟加拉语作为孟加拉国的母语,在印度次大陆中第二受欢迎的语言为孟加拉。为了设计孟加拉语lemmatizer(命名为BenLem),研究了可能的转换,通过该转换可以从词元形成表面词,以便可以对表面词应用适当的逆变换,以获取对应的词条。发现BenLem能够处理孟加拉语中的变形和派生形态。它是根据从FIRE Bengali新闻语料库中提取的18条新闻文章进行评估的,该新闻文章由3,342个表面词(不包括专有名词)组成,准确性为81.95%。然后对孟加拉国水务署研究了增粘剂的作用。十个高度多义的孟加拉语单词被认为可以消除歧义。 FIRE语料库和Tagore的短篇小说集被视为创建WSD数据集。本实验考虑使用不同的WSD系统,并且注意到BenLem改进了所有WSD系统的性能,并且该改进在统计上是有意义的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号