【24h】

Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data

机译:对合成数据进行无监督预训练的神经语法错误校正系统

获取原文

摘要

Considerable effort has been made to address the data sparsity problem in neural grammatical error correction. In this work, we propose a simple and surprisingly effective unsupervised synthetic error generation method based on confusion sets extracted from a spellchecker to increase the amount of training data. Synthetic data is used to pre-train a Transformer sequence-to-sequence model, which not only improves over a strong baseline trained on authentic error-annotated data, but also enables the development of a practical GEC system in a scenario where little genuine error-annotated data is available. The developed systems placed first in the BEA 19 shared task, achieving 69.47 and 64.24 F_(0.5) in the restricted and low-resource tracks respectively, both on the W&I+LOCNESS test set. On the popular CoNLL 2014 test set, we report state-of-the-art results of 64.16 M~2 for the submitted system, and 61.30 M~2 for the constrained system trained on the NUCLE and Lang-8 data.
机译:为了解决神经语法错误校正中的数据稀疏性问题,已经做出了相当大的努力。在这项工作中,我们提出了一种简单有效的无监督综合错误生成方法,该方法基于从拼写检查器中提取的混淆集来增加训练数据量。合成数据用于预训练Transformer序列到序列模型,该模型不仅可以提高对基于真实错误注释数据进行训练的强大基线,而且还可以在实际误差很小的情况下开发实用的GEC系统注释的数据可用。在W&I + LOCNESS测试集上,已开发的系统在BEA 19共享任务中排在首位,分别在受限和资源匮乏的轨道上达到69.47 F_(0.5)。在流行的CoNLL 2014测试集上,我们报告了最新系统的结果,即提交的系统为64.16 M〜2,受约束的系统的最新结果为61.30 M〜2,这些系统在NUCLE和Lang-8数据上进行了训练。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号