【24h】

A Simple Recipe for Multilingual Grammatical Error Correction

机译:多语言语法纠错的简单配方

获取原文

摘要

This paper presents a simple recipe to train state-of-the-art multilingual Grammatical Error Correction (GEC) models. We achieve this by first proposing a language-agnostic method to generate a large number of synthetic examples. The second ingredient is to use large-scale multilingual language models (up to 1 1B parameters). Once fine-tuned on language-specific supervised sets we surpass the previous state-of-the-art results on GEC benchmarks in four languages: English, Czech, German and Russian. Having established a new set of baselines for GEC, we make our results easily reproducible and accessible by releasing a CLANG-8 dataset. It is produced by using our best model, which we call gT5, to clean the targets of a widely used yet noisy LANG-8 dataset. CLANG-8 greatly simplifies typical GEC training pipelines composed of multiple fine-tuning stages - we demonstrate that performing a single fine-tuning step on CLANG-8 with the off-the-shelf language models yields further accuracy improvements over an already top-performing gT5 model for English.
机译:本文介绍了培训最先进的多语言语法纠错(GEC)模型的简单配方。我们通过首先提出一种语言无神不可知方法来实现这一目标,以产生大量的合成示例。第二种成分是使用大规模的多语言语言模型(最多1个1 1个参数)。一旦微调语言特定的监督集,我们将超越以前的四种语​​言的GEC基准的最先进的结果:英语,捷克语,德语和俄语。为GEC建立了一组新的基线,我们通过释放Clang-8数据集可以轻松可再现和可访问的结果。它是通过使用我们称之为GT5的最佳模型来制作,清理广泛使用且嘈杂的LANG-8数据集的目标。 Clang-8大大简化了由多种微调阶段组成的典型GEC训练管道 - 我们证明在具有现成语言模型中对Clang-8进行单一微调步骤产生了进一步的准确性改进,而是通过已经顶部执行的GT5英语模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号