【24h】

Corpora Generation for Grammatical Error Correction

机译:用于语法错误校正的语料库生成

获取原文

摘要

Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduclion problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics, while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL-2014 benchmark and the JFLEG task. We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.
机译:语法错误校正(GEC)最近已使用序列到序列框架进行建模。但是,与诸如机器翻译之类的序列转导问题不同,GEC缺少大量并行数据。我们描述了两种使用公开可用的维基百科数据为GEC生成大型并行数据集的方法。第一种方法以最小的过滤试探法从Wikipedia编辑历史中提取源-目标对,而第二种方法则通过跨桥梁语言的双向翻译将噪声引入Wikipedia句子中。两种策略都产生相似大小的并行语料库,其中包含大约4B令牌。我们采用了一种迭代解码策略,该策略适合于构造的语料库的松散监督性质。我们证明,使用任何一种语料库训练的神经GEC模型都具有相似的性能。在Lang-8语料库上对这些模型进行微调和集成,使我们在CoNLL-2014基准测试和JFLEG任务上都超越了最新技术水平。我们提供了系统分析,比较了两种数据生成方法,并强调了整合的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号