【24h】

A Lazy Man's Way to Part-of-Speech Tagging

机译:懒惰的男人参加演讲标签的方式

获取原文

摘要

A statistical-based approach to word alignment involving automatically projecting part-of-speech (POS) tags is presented. The approach is referred to as the "lazy man's way" because it improves POS assignment for a resource-poor language by exploiting its similarity to a resource-rich one. This unsupervised learning method combines the N-gram and Dice Coefficient similarity functions in order to align English texts with Malay texts thus projecting the POS tags from English to Malay. It is a quick method that does not require the laborious effort needed to annotate the Malay dataset. A case study, an experiment done on 25 terrorism news articles written in Malay, has shown that leveraging pre-existing resources from a resource-rich language, i.e. English, to supplement a resource-poor language, i.e. Malay, is feasible and avoids building new text-processing tools from scratch. The system was tested on the Malay corpus, consisting of 5413 word tokens. The results reached values of 86.87% for precision, 72.56% for recall and 79.07% for F1-Score. This shows that the "lazy man's way", where a resource-poor language just exploits the rich linguistic information available in English, increases bitext projection accuracy significantly.
机译:提出了一种基于词对齐的基于统计的方法,涉及自动投影语音部分(POS)标签。这种方法被称为“懒惰人的方式”,因为它通过利用其与资源丰富的语言来改善资源差的语言的POS分配。这种无监督的学习方法结合了n-gram和骰子系数相似度函数,以便将英语文本与马来文文本调整,从而将POS标记从英语从英语投影到马来语。这是一种快速的方法,不需要注释马来数据集所需的费力。一个案例研究,在马来书面写入的25个恐怖主义新闻文章的实验表明,利用资源丰富的语言,即英语,以补充资源差的语言,即马来语,是可行的,是可行的,是可行的,是可行的,是可行的,是可行的,是可行的,是可行的,是可行的,是可行的,是可行的,避免建设从头开始的新文本处理工具。该系统在马来的语料库上测试,由5413个字标记组成。结果达到86.87%的精度,召回72.56%,F1分数为79.07%。这表明,“懒惰人的方式”,资源匮乏的语言只是利用英语提供的丰富语言信息,显着提高了BITEXT投影精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号