首页> 外文会议>Workshop on innovative use of NLP for building educational applications;Annual meeting of the Association for Computational Linguistics >A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction
【24h】

A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction

机译:英文拼写错误的基准语料库和拼写校正的最小监督模型

获取原文

摘要

Spelling correction has attracted a lot of attention in the NLP community. However, models have been usually evaluated on artificially-created or proprietary corpora. A publicly-available corpus of authentic misspellings, annotated in context, is still lacking. To address this, we present and release an annotated data set of 6,121 spelling errors in context, based on a corpus of essays written by English language learners. We also develop a minimally-supervised context-aware approach to spelling correction. It achieves strong results on our data: 88.12% accuracy. This approach can also train with a minimal amount of annotated data (performance reduced by less than 9c). Furthermore, this approach allows easy portability to new domains. We evaluate our model on data from a medical domain and demonstrate that it rivals the performance of a model trained and tuned on in-domain data.
机译:拼写校正已在NLP社区中引起了很多关注。但是,通常使用人工创建的或专有的语料对模型进行评估。仍缺少公开注释的真实拼写语料库,并附有上下文注释。为了解决这个问题,我们根据英语学习者撰写的论文集,在上下文中提出并发布了一个注释数据集,包含6,121个拼写错误。我们还开发了一种最小监督的上下文感知方法来纠正拼写。它在我们的数据上获得了出色的结果:88.12%的准确性。这种方法还可以使用最少数量的带注释的数据进行训练(性能降低不到9c)。此外,这种方法允许轻松移植到新域。我们根据医学领域的数据评估我们的模型,并证明它可以与在领域内数据上训练和调整的模型相媲美。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号