Bangla word clustering based on N-gram language model

机译：基于N-gram语言模型的孟加拉词聚类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we describe a method for producing Bangla word clusters based on semantic and contextual similarity. Word clustering is important for parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. Computerization of Bangla language processing has been started a long ago, but still it is in neophyte stage and suffers from resource scarcity. We propose anunsupervised machine learning technique to develop Bangla word clusters based on their semantic and contextual similarity using N-gram language model. According to N-gram model, a word can be predictedbased on its previous and next words sequence. N-gram model is applied successfully for word clustering in English and some other languages. As word clustering in Bangla is a new dimension in Bangla language processing research, so we think this process is good way to start and our assumption is true as our result is quite decent. We produced 456 clusters using a locally available large Bangla corpus. Subjective score derived from the clusters reveal strong similarity of the words in the same cluster.

机译：在本文中，我们描述了一种基于语义和上下文相似性的孟加拉语词簇生成方法。单词聚类对于语音部分（POS）标记，单词义消除歧义，文本分类，推荐系统，拼写检查器，语法检查器，知识发现以及许多其他自然语言处理（NLP）应用程序都很重要。孟加拉语言处理的计算机化已经很早就开始了，但是它仍处于新手阶段并且遭受资源短缺的困扰。我们提出了一种无监督的机器学习技术，以基于N-gram语言模型的语义和上下文相似性来开发孟加拉语单词簇。根据N-gram模型，可以根据一个单词的前一个和下一个单词序列来预测一个单词。 N-gram模型已成功应用于英语和其他一些语言的单词聚类。由于孟加拉语中的单词聚类是孟加拉语语言处理研究的一个新维度，因此我们认为此过程是一个很好的起点，并且我们的假设是正确的，因为我们的结果相当不错。我们使用本地可用的大型Bangla语料库生成了456个簇。从聚类中得出的主观分数显示出同一聚类中单词的强烈相似性。

著录项

来源
《International Conference on Electrical Engineering and Information Communication Technology》|2014年|1-5|共5页
会议地点
作者
Ismail Sabir; Rahman M.Shahidur;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
information retrival; machine learning; n-gram model; natural language processing; word cluster;

机译：信息检索;机器学习n-gram模型;自然语言处理;词簇;

相似文献

外文文献
中文文献
专利

1. Multi-class composite N-gram language model using multiple word clusters and word successions [J] . Hirofumi Yamamoto, Shuntarou Isogai, Yoshinori Sagisaka 電子情報通信学会技術研究報告. 音声. Speech . 2001,第156期

机译：使用多个单词簇和单词继承的多类复合N-gram语言模型
2. Multi-class composite N-gram language model using multiple word clusters and word successions [J] . Hirofumi Yamamoto, Shuntarou Isogai, Yoshinori Sagisaka 電子情報通信学会技術研究報告. 音声. Speech . 2001,第156期

机译：使用多个单词集群和Word Arucessions的多级复合N-GRAM语言模型
3. Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity [J] . Welly NAPTALI, Masatoshi TSUCHIYA, Seiichi NAKAGAWA IEICE transactions on information and systems . 2012,第9期

机译：基于词外到词内相似度的新词基于类的N-Gram语言模型
4. Bangla word clustering based on N-gram language model [C] . Ismail Sabir, Rahman M.Shahidur International Conference on Electrical Engineering and Information Communication Technology . 2014

机译：基于N-GRAN语言模型的Bangla Word集群
5. Language-independent text learning with statistical n-gram language models. [D] . Peng, Fuchun. 2003

机译：统计n-gram语言模型的独立于语言的文本学习。
6. Words prediction based on N-gram model for free-text entry in electronic health records [O] . Azita Yazdani, Reza Safdari, Ali Golkar, 2019

机译：基于N-GRAM模型的电子健康记录中自由文本输入的单词预测
7. Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters [O] . Hirofumi Yamamoto, Shuntaro Isogai 2001

机译：用于多语言集群的口语处理的多类复合N-gram语言模型

Bangla word clustering based on N-gram language model

摘要

著录项

相似文献

相关主题

期刊订阅