首页> 外文会议>Workshop on noisy user-generated text >Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content
【24h】

Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

机译:数据增强对基于NMT的文本规范化的效益

获取原文

摘要

One of the most persistent characteristics of written user-generated content (UGC) is the use of non-standard words. This characteristic contributes to an increased difficulty to automatically process and analyze UGC. Text normalization is the task of transforming lexical variants to their canonical forms and is often used as a pre-processing step for conventional NLP tasks in order to overcome the performance drop that NLP systems experience when applied to UGC. In this work, we follow a Neural Machine Translation approach to text normalization. To train such an encoder-decoder model, large parallel training corpora of sentence pairs are required. However, obtaining large data sets with UGC and their normalized version is not trivial, especially for languages other than English. In this paper, we explore how to overcome this data bottleneck for Dutch, a low-resource language. We start off with a small publicly available parallel Dutch data set comprising three UGC genres and compare two different approaches. The first is to manually normalize and add training data, a money and time-consuming task. The second approach is a set of data augmentation techniques which increase data size by converting existing resources into synthesized non-standard forms. Our results reveal that, while the different approaches yield similar results regarding the normalization issues in the test set, they also introduce a large amount of over-normalizations.
机译:书面用户生成内容(UGC)的最持久的特征之一是使用非标准单词。这种特性有助于自动处理和分析UGC的难度增加。文本归一化是将词汇变体转换为其规范形式的任务,并且通常用作传统NLP任务的预处理步骤,以克服应用于UGC时的NLP系统体验的性能降低。在这项工作中,我们遵循一个神经机翻译方法来进行文本归一化。要培训这样一个编码器解码器模型,需要大的并行训练句子对。但是,使用UGC获得大数据集及其归一化版本并不琐碎,特别是对于英语以外的语言。在本文中,我们探索如何克服荷兰语,低资源语言的数据瓶颈。我们开始使用包含三种UGC类型的小公共可用并行荷兰数据集,并比较两种不同的方法。首先是手动正常化和添加培训数据,金钱和耗时的任务。第二种方法是一系列数据增强技术,其通过将现有资源转换为合成的非标准表单来增加数据大小。我们的结果表明,虽然不同的方法在测试集中的标准化问题上产生类似的结果,但它们也引入了大量的过度训练。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号