Corpora Generation for Grammatical Error Correction

机译：用于语法错误校正的语料库生成

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduclion problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics, while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL-2014 benchmark and the JFLEG task. We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

机译：语法错误校正（GEC）最近已使用序列到序列框架进行建模。但是，与诸如机器翻译之类的序列转导问题不同，GEC缺少大量并行数据。我们描述了两种使用公开可用的维基百科数据为GEC生成大型并行数据集的方法。第一种方法以最小的过滤试探法从Wikipedia编辑历史中提取源－目标对，而第二种方法则通过跨桥梁语言的双向翻译将噪声引入Wikipedia句子中。两种策略都产生相似大小的并行语料库，其中包含大约4B令牌。我们采用了一种迭代解码策略，该策略适合于构造的语料库的松散监督性质。我们证明，使用任何一种语料库训练的神经GEC模型都具有相似的性能。在Lang-8语料库上对这些模型进行微调和集成，使我们在CoNLL-2014基准测试和JFLEG任务上都超越了最新技术水平。我们提供了系统分析，比较了两种数据生成方法，并强调了整合的有效性。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2019年|3291-3301|共11页
会议地点
作者
Jared Lichtarge; Chris Alberti; Shankar Kumar; Noam Shazeer; Niki Parmar; Simon Tong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. An Approach to NMT Re-Ranking Using Sequence-Labeling for Grammatical Error Correction [J] . Bo Wang, Kaoru Hirota, Chang Liu, Journal of Advanced Computatioanl Intelligence and Intelligent Informatics . 2020,第4144期

机译：使用序列标记进行语法纠错的NMT重新排序方法
2. Grammatical and context-sensitive error correction using a statistical machine translation framework [J] . Nava Ehsan, Heshaam Faili Software . 2013,第2期

机译：使用统计机器翻译框架进行语法和上下文相关的错误纠正
3. Improvement of the LR parsing table and its application to grammatical error correction [J] . Shishibori M., Lee SS., Oono M., Information Sciences: An International Journal . 2002,第1a4期

机译：LR解析表的改进及其在语法错误纠正中的应用
4. Corpora Generation for Grammatical Error Correction [C] . Jared Lichtarge, Chris Alberti, Shankar Kumar, Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2019

机译：语法纠错的公司生成
5. Age and knowledge of morphosyntax in English as an additional language: Grammatical judgment and error correction. [D] . Qureshi, Muhammad Asif. 2015

机译：英语作为另一种语言的年龄和句法知识：语法判断和错误纠正。
6. Fewer grammatical errors means less editing [O] . Mark B. Mycyk 2009

机译：较少的语法错误意味着更少的编辑
7. Corpora Generation for Grammatical Error Correction [O] . Jared Lichtarge, Chris Alberti, Shankar Kumar, 2019

机译：语法纠错的公司生成

Corpora Generation for Grammatical Error Correction

摘要

著录项

相似文献

相关主题

期刊订阅