首页> 外文会议>International and interdisciplinary conference on modeling and using context >Context-Driven Corpus-Based Model for Automatic Text Segmentation and Part of Speech Tagging in Setswana Using OpenNLP Tool
【24h】

Context-Driven Corpus-Based Model for Automatic Text Segmentation and Part of Speech Tagging in Setswana Using OpenNLP Tool

机译:使用OpenNLP工具在Setswana中基于上下文驱动语料库的自动文本分割和语音标记的模型

获取原文

摘要

Setswana is an under-resourced Bantu African language that is morphologically rich with the disjunctive writing system. Developing NLP pipeline tools for such a language could be challenging, due to the need to balance the linguistics semantics robustness of the tool with computational parsimony. A Part-of-Speech (POS) tagger is one such NLP tool for assigning lexical categories like noun, verb, pronoun, and so on, to each word in a text corpus. POS tagging is an important task in Natural Language Processing (NLP) applications such as information extraction, Machine Translation. Word prediction, etc. Developing a POS tagger for a morphologically rich language such as Setswana has computational linguistics challenges that could affect the effectiveness of the entire NLP system. This is due to some contextual semantics features of the language, that demand a fine-grained granularity level for the required POS tagset, with the need to balance tool semantic robustness with computational parsimony. In this paper, a context-driven corpus-based model for text segmentation and POS tagging for the language is presented. The tagger is developed using the Apache OpenNLP tool and returns the accuracy of 96.73%.
机译:塞斯瓦纳(Setswana)是资源贫乏的班图族非洲人语言,其形态在语言上很丰富,具有分离式写作系统。由于需要在工具的语言学语义健壮性与计算简约性之间取得平衡,因此开发用于这种语言的NLP管道工具可能具有挑战性。词性(POS)标记器是一种这样的NLP工具,用于为文本语料库中的每个单词分配词汇类别,例如名词,动词,代词等。 POS标记是自然语言处理(NLP)应用程序中的重要任务,例如信息提取,机器翻译。单词预测等。为诸如Setswana之类的形态丰富的语言开发POS标签器会带来计算语言学挑战,这可能会影响整个NLP系统的有效性。这是由于该语言的某些上下文语义特征所致,这些特征要求所需的POS标签集具有细粒度的级别,并且需要在工具语义鲁棒性与计算简约性之间取得平衡。本文提出了一种基于上下文驱动的语料库的语言文本分割和POS标记模型。标记器是使用Apache OpenNLP工具开发的,返回的准确性为96.73%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号