首页> 外文会议>Workshop on noisy user-generated text >Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content
【24h】

Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

机译:数据增强对基于NMT的用户生成内容进行文本规范化的好处

获取原文

摘要

One of the most persistent characteristics of written user-generated content (UGC) is the use of non-standard words. This characteristic contributes to an increased difficulty to automatically process and analyze UGC. Text normalization is the task of transforming lexical variants to their canonical forms and is often used as a pre-processing step for conventional NLP tasks in order to overcome the performance drop that NLP systems experience when applied to UGC. In this work, we follow a Neural Machine Translation approach to text normalization. To train such an encoder-decoder model, large parallel training corpora of sentence pairs are required. However, obtaining large data sets with UGC and their normalized version is not trivial, especially for languages other than English. In this paper, we explore how to overcome this data bottleneck for Dutch, a low-resource language. We start off with a small publicly available parallel Dutch data set comprising three UGC genres and compare two different approaches. The first is to manually normalize and add training data, a money and time-consuming task. The second approach is a set of data augmentation techniques which increase data size by converting existing resources into synthesized non-standard forms. Our results reveal that, while the different approaches yield similar results regarding the normalization issues in the test set, they also introduce a large amount of over-normalizations.
机译:书面用户生成内容(UGC)的最持久特征之一是使用非标准单词。此特征导致自动处理和分析UGC的难度增加。文本规范化是将词汇变体转换为规范形式的任务,通常被用作常规NLP任务的预处理步骤,以克服NLP系统应用于UGC时性能下降的问题。在这项工作中,我们遵循神经机器翻译方法进行文本规范化。为了训练这种编码器-解码器模型,需要句子对的大型并行训练语料库。但是,使用UGC及其规范化版本获取大数据集并非易事,特别是对于英语以外的语言。在本文中,我们探索了如何克服这种资源匮乏的荷兰语的数据瓶颈。我们从一个小型的,可公开获得的平行荷兰数据集开始,其中包括三个UGC体裁,并比较了两种不同的方法。首先是手动规范化并添加培训数据,这是一项耗时且费时的工作。第二种方法是一组数据增强技术,可通过将现有资源转换为合成的非标准形式来增加数据大小。我们的结果表明,尽管不同的方法在测试集中的归一化问题上产生了相似的结果,但它们也引入了大量的过度归一化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号