【24h】

Bangla word clustering based on N-gram language model

机译:基于N-gram语言模型的孟加拉词聚类

获取原文

摘要

In this paper, we describe a method for producing Bangla word clusters based on semantic and contextual similarity. Word clustering is important for parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. Computerization of Bangla language processing has been started a long ago, but still it is in neophyte stage and suffers from resource scarcity. We propose anunsupervised machine learning technique to develop Bangla word clusters based on their semantic and contextual similarity using N-gram language model. According to N-gram model, a word can be predictedbased on its previous and next words sequence. N-gram model is applied successfully for word clustering in English and some other languages. As word clustering in Bangla is a new dimension in Bangla language processing research, so we think this process is good way to start and our assumption is true as our result is quite decent. We produced 456 clusters using a locally available large Bangla corpus. Subjective score derived from the clusters reveal strong similarity of the words in the same cluster.
机译:在本文中,我们描述了一种基于语义和上下文相似性的孟加拉语词簇生成方法。单词聚类对于语音部分(POS)标记,单词义消除歧义,文本分类,推荐系统,拼写检查器,语法检查器,知识发现以及许多其他自然语言处理(NLP)应用程序都很重要。孟加拉语言处理的计算机化已经很早就开始了,但是它仍处于新手阶段并且遭受资源短缺的困扰。我们提出了一种无监督的机器学习技术,以基于N​​-gram语言模型的语义和上下文相似性来开发孟加拉语单词簇。根据N-gram模型,可以根据一个单词的前一个和下一个单词序列来预测一个单词。 N-gram模型已成功应用于英语和其他一些语言的单词聚类。由于孟加拉语中的单词聚类是孟加拉语语言处理研究的一个新维度,因此我们认为此过程是一个很好的起点,并且我们的假设是正确的,因为我们的结果相当不错。我们使用本地可用的大型Bangla语料库生成了456个簇。从聚类中得出的主观分数显示出同一聚类中单词的强烈相似性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号