首页> 外文会议>International Conference on Innovative Computing Technology >Romanized urdu Corpus development (RUCD) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset
【24h】

Romanized urdu Corpus development (RUCD) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset

机译:罗马化的乌尔都语语料库开发(RUCD)模型:使用实时交互式数据集的基于编辑距离的最频繁的唯一字母组合提取方法

获取原文

摘要

Urdu ranks very high among languages used for communication in the Sourthrn Asia. Even though with great following, it clearly lack computational support that is why it is written in Romanized Urdu script. Even though, a lot of Romanized Urdu data is available online but it still lacks a refined Corpus. In our research, we have proposed a refined Romanized urdu Corpus using tokens with the highest frequency of occurrence in the data set, which was collected from volunteer participants who used this language as a mode of communication interactively. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. "Edit Distance" and "K-means Clustering" techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development.
机译:在Sourthrn亚洲用于交流的语言中,乌尔都语的排名很高。即使拥有大量追随者,它显然也缺乏计算支持,这就是为什么它使用罗马化的乌尔都语脚本编写的原因。尽管可以在线获取许多罗马化的乌尔都语数据,但仍缺乏完善的语料库。在我们的研究中,我们提出了一种精炼的罗马化乌尔都语语料库,该语料库使用了数据集中出现频率最高的令牌,这些令牌是从自愿者收集的,他们使用这种语言作为交互方式进行交流。在将原始语料库传递到计算上广泛的后续步骤之前,它先经过一系列步骤,例如“预设”,“标记化”和“注释”。 “编辑距离”和“ K均值聚类”技术用于识别候选标记及其在经过精炼的词典中的潜在选择/包含。我们还从收集的数据中识别出最常用的标记,候选标记和其他语言属性。在分析的基础上,我们提出了精细的口语化罗马化乌尔都语词典开发的计算模型。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号