首页> 外文会议>Pacific Rim knowledge acquisition workshop >A Lazy Man's Way to Part-of-Speech Tagging
【24h】

A Lazy Man's Way to Part-of-Speech Tagging

机译:懒惰的词性标记方式

获取原文

摘要

A statistical-based approach to word alignment involving automatically projecting part-of-speech (POS) tags is presented. The approach is referred to as the "lazy man's way" because it improves POS assignment for a resource-poor language by exploiting its similarity to a resource-rich one. This unsupervised learning method combines the N-gram and Dice Coefficient similarity functions in order to align English texts with Malay texts thus projecting the POS tags from English to Malay. It is a quick method that does not require the laborious effort needed to annotate the Malay dataset. A case study, an experiment done on 25 terrorism news articles written in Malay, has shown that leveraging pre-existing resources from a resource-rich language, i.e. English, to supplement a resource-poor language, i.e. Malay, is feasible and avoids building new text-processing tools from scratch. The system was tested on the Malay corpus, consisting of 5413 word tokens. The results reached values of 86.87% for precision, 72.56% for recall and 79.07% for Fl-Score. This shows that the "lazy man's way", where a resource-poor language just exploits the rich linguistic information available in English, increases bitext projection accuracy significantly.
机译:提出了一种基于统计的词对齐方法,该方法涉及自动投影词性(POS)标签。该方法被称为“懒人之道”,因为它通过利用资源贫乏的语言与资源丰富的语言的相似性来改善它的POS分配。这种无监督学习方法结合了N-gram和骰子系数相似度函数,以使英语文本与马来文本对齐,从而将POS标签从英语投射到马来语。这是一种快速的方法,不需要注释马来数据集所需的费力工作。一项案例研究是一项针对用马来语撰写的25条恐怖主义新闻文章的实验,结果表明,利用资源丰富的语言(例如英语)中的现有资源来补充资源贫乏的语言(例如马来语)是可行的,并且避免了构建从头开始使用新的文本处理工具。该系统在由5413个单词标记组成的马来语语料库上进行了测试。结果的精度达到86.87%,召回率为72.56%,Fl-Score为79.07%。这表明,资源匮乏的语言仅利用英语中可用的丰富语言信息的“懒人之道”显着提高了bitext投影的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号