【24h】

Formal Methods of Tokenization for Part-of-Speech Tagging

机译:用于部分语音标记的正式标记方法

获取原文

摘要

One of the most important prior tasks for robust part-of-speech tagging is the correct tokenization or segmentation of the texts. This task can involve processes which are much more complex than the simple identification of the different sentences in the text and each of their individual components, but it is often obviated in many current applications. Nevertheless, this preprocessing step is an indispensable task in practice, and it is particularly difficult to tackle it with scientific precision without falling repeatedly in the analysis of the specific casuistry of every phenomenon detected. In this work, we have developed a scheme of preprocessing oriented towards the disambiguation and robust tagging of Galician. Nevertheless, it is a proposal of a general architecture that can be applied to other languages, such as Spanish, with very slight modifications.
机译:强大的语音标记的最重要的先前任务之一是文本的正确标记或分段。此任务可以涉及更复杂的进程,这些进程比文本中的不同句子和每个各个组件的简单识别,但在许多当前应用程序中经常避免它。然而,这种预处理步骤是实践中不可或缺的任务,并且特别困难地用科学精度解决而不在分析每个检测到的每种现象的特定液体的分析中。在这项工作中,我们开发了一种以预处理为导向的预处理方案,朝着加利西亚人的消歧和鲁棒标记。尽管如此,它是一个可以应用于其他语言的一般架构的提议,例如西班牙语,具有非常轻微的修改。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号