首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
【24h】

Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

机译:使用大型单明性和双语语料库来改善协调消济歧义

获取原文

摘要

Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Perm Treebank are reporting impressive numbers these days, but they don't do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Eu-roparl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96% accuracy on Treebank data and makes 20% fewer errors than a supervised system trained with Treebank annotations.
机译:解决协调歧义是一个经典的难题。本文在复杂的名词短语(NPS)中看待协调歧义。 PERM TREEBANK培训的解析器这些天正在报告令人印象深刻的数字,但他们对这个问题没有很好做得很好(79%)。我们探讨使用三种类型的语料库训练系统:(1)注释(例如,宾州树库),(2)bitexts(例如铕roparl),和(3)未注释的单语(例如谷歌的n-gram)。尺寸问题:(1)是一百万字,(2)可能是数十亿的单词,(3)是占星数量的单词。当通过词汇项之间的关联解决歧义时,未经发布的单格式数据是有用的。当通过翻译中的单词顺序解决歧义,双语数据是有用的。我们培训单独的分类器,并通过联合培训来策略改善它们。共同训练的分类器在TreeBank数据上实现了接近96%的精度,并且比使用TreeBank注释培训的监督系统误差20%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号