首页> 外文会议>Widening Natural Language Processing Workshop >Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features
【24h】

Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features

机译:使用条件随机字段和形态上下文特征增强URDU字分割

获取原文

摘要

Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F_1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of-the-art methods.
机译:Word Segmentation是大多数NLP应用程序的基本任务。 Urdu采用Nastalique写作风格,没有空间的概念。此外,URDU中某些字符的固有非加入属性在数字格式编写时在单词中创建空格。因此,URDU不仅具有空间遗漏,还具有空间插入问题,这使得单词分割任务具有挑战性。在本文中,我们通过使用19,651个句子的手动注释的语料库以及形态背景特征来改善Zia,Raza和Athar(2018)的结果。使用条件随机场序列建模器,我们的模型实现了0.98的F_1分数,用于单词边界标识和子字边界识别任务的0.92。在本文中所示的结果优于最先进的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号