首页> 外文会议>International Conference on Informatics, Electronics and Vision >A corpus based unsupervised Bangla word stemming using N-gram language model
【24h】

A corpus based unsupervised Bangla word stemming using N-gram language model

机译:一种基于语料库的无人监督的孟加拉词,使用n克语言模型源

获取原文

摘要

In this paper, we propose a contextual similarity based approach for identification of stems or root forms of Bangla words using N-gram language model. The core purpose of our work is to build a big corpus of Bangla stems with their corresponding inflectional forms. Identification of stem form of a word is generally called stemming and the tool which identifies the stems is called stemmer. Stemmers are important mainly in information retrieval systems, recommending systems, spell checkers, search engines and other sectors of Natural Language Processing applications. We selected N-gram model for stem detection based on the assumption that if two words which exhibit a certain percentage of similarity in spelling and have a certain percentage of contextual similarity in many sentences then these words have higher probability of originating from the same root. We implemented 6-gram model for the stem identification procedure and we gained 40.18% accuracy for our corpus.
机译:在本文中,我们提出了一种基于语境相似性的方法,用于使用n克语言模型识别Bangla单词的茎或根形式。我们作品的核心目的是建立一个孟加拉的大语料,其具有相应的折射形式。识别单词的茎形式通常被称为茎和识别茎的工具被称为茎。 SEMPMERS主要是在信息检索系统,推荐系统,拼写检查,搜索引擎和自然语言处理应用程序的其他部门中。我们选择了基于假设拼写在许多句子中具有一定百分比的相似性并且在许多句子中具有一定百分比的语境相似性的单词,因此这些词具有较高概率源自同一根的单词,因此选择了N-Gram检测。我们为茎识别程序实施了6克模型,我们的语料库获得了40.18%的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号