首页> 外文会议>International Conference on Innovative Computing Technology >Romanized Urdu Corpus Development (RUCD) Model: Edit-Distance Based Most Frequent Unique Unigram Extraction Approach Using Real-Time Interactive Dataset
【24h】

Romanized Urdu Corpus Development (RUCD) Model: Edit-Distance Based Most Frequent Unique Unigram Extraction Approach Using Real-Time Interactive Dataset

机译:罗马化Urdu语料库开发(RUCD)模型:基于距离最常用的独特Unigram提取方法使用实时交互数据集

获取原文

摘要

Urdu ranks very high among languages used for communication in the Sourthrn Asia. Even though with great following, it clearly lack computational support that is why it is written in Romanized Urdu script. Even though, a lot of Romanized Urdu data is available online but it still lacks a refined Corpus. In our research, we have proposed a refined Romanized urdu Corpus using tokens with the highest frequency of occurrence in the data set, which was collected from volunteer participants who used this language as a mode of communication interactively. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. "Edit Distance" and "K-means Clustering" techniques are used for identification of candidate tokens and their potential selection/inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development.
机译:乌尔都语在斯托纳亚洲沟通的语言中排名非常高。即使在很大之后,它显然缺乏计算支持,这就是为什么它在罗马化乌尔都语脚本中编写。尽管如此,很多罗马化的Urdu数据都可以在线获得,但它仍然缺乏精制的语料库。在我们的研究中,我们提出了一种通过令牌在数据集中出现最高频率的令牌提出了一种精致的罗马化Urdu语料库,这些志愿参与者与交互方式使用这种语言作为通信方式的志愿参与者。原始语料库通过一系列步骤,例如在将其传递到计算广泛的后续步骤之前,例如预先进行,标记和注释。 “编辑距离”和“K-Means Clustering”技术用于识别候选令牌及其在精炼lexicon中的潜在选择/包含。我们还从收集的数据中识别了最常用的令牌,候选令牌和其他语言属性。基于分析,我们提出了一种改进口语罗马化乌尔都语开发的计算模型。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号