首页> 外文会议>Annual meeting of the Association for Computational Linguistics;ACL 2011 >Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
【24h】

Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

机译:使用大型单语和双语语料库改善协调度消歧

获取原文

摘要

Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Perm Treebank are reporting impressive numbers these days, but they don't do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Eu-roparl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96% accuracy on Treebank data and makes 20% fewer errors than a supervised system trained with Treebank annotations.
机译:解决协调歧义是一个经典的难题。本文着眼于复杂名词短语(NPs)中的协调消歧。如今,经过彼尔姆树银行培训的解析器的报告数量令人印象深刻,但他们在此问题上的表现并不理想(79%)。我们探索使用三种语料库训练的系统:(1)带注释的(例如Penn Treebank),(2)二进制扩展名(例如Eu-roparl)和(3)无注释的单语种(例如Google N-gram)。大小很重要:(1)是一百万个单词,(2)可能是数十亿个单词,(3)可能是数万亿个单词。当可以通过词汇项之间的关联来解决歧义时,未注释的单语数据会很有帮助。当可以通过翻译中的单词顺序解决歧义时,双语数据会很有帮助。我们训练具有单语和双语功能的单独分类器,并通过共同训练来迭代地改进它们。与使用Treebank注释训练的监督系统相比,共同训练的分类器在Treebank数据上的准确性接近96%,并且错误减少了20%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号