首页> 外文会议>Workshop on noisy user-generated text >Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

【24h】

Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

机译：数据增强对基于NMT的文本规范化的效益

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

One of the most persistent characteristics of written user-generated content (UGC) is the use of non-standard words. This characteristic contributes to an increased difficulty to automatically process and analyze UGC. Text normalization is the task of transforming lexical variants to their canonical forms and is often used as a pre-processing step for conventional NLP tasks in order to overcome the performance drop that NLP systems experience when applied to UGC. In this work, we follow a Neural Machine Translation approach to text normalization. To train such an encoder-decoder model, large parallel training corpora of sentence pairs are required. However, obtaining large data sets with UGC and their normalized version is not trivial, especially for languages other than English. In this paper, we explore how to overcome this data bottleneck for Dutch, a low-resource language. We start off with a small publicly available parallel Dutch data set comprising three UGC genres and compare two different approaches. The first is to manually normalize and add training data, a money and time-consuming task. The second approach is a set of data augmentation techniques which increase data size by converting existing resources into synthesized non-standard forms. Our results reveal that, while the different approaches yield similar results regarding the normalization issues in the test set, they also introduce a large amount of over-normalizations.

机译：书面用户生成内容（UGC）的最持久的特征之一是使用非标准单词。这种特性有助于自动处理和分析UGC的难度增加。文本归一化是将词汇变体转换为其规范形式的任务，并且通常用作传统NLP任务的预处理步骤，以克服应用于UGC时的NLP系统体验的性能降低。在这项工作中，我们遵循一个神经机翻译方法来进行文本归一化。要培训这样一个编码器解码器模型，需要大的并行训练句子对。但是，使用UGC获得大数据集及其归一化版本并不琐碎，特别是对于英语以外的语言。在本文中，我们探索如何克服荷兰语，低资源语言的数据瓶颈。我们开始使用包含三种UGC类型的小公共可用并行荷兰数据集，并比较两种不同的方法。首先是手动正常化和添加培训数据，金钱和耗时的任务。第二种方法是一系列数据增强技术，其通过将现有资源转换为合成的非标准表单来增加数据大小。我们的结果表明，虽然不同的方法在测试集中的标准化问题上产生类似的结果，但它们也引入了大量的过度训练。

著录项

来源
《Workshop on noisy user-generated text》|2019年|xix 448 p.|共11页
会议地点
作者
Claudia Matos Veliz; Orphee De Clercq; Veronique Hoste;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Multimodular Text Normalization of Dutch User-Generated Content [J] . Schulz Sarah, De Pauw Guy, De Clercq Orphee, ACM transactions on intelligent systems . 2016,第4期

机译：荷兰用户生成内容的多模块文本规范化
2. A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content [J] . Matthew J. Schneider, Shawn Mankad Customer Needs and Solutions . 2021,第3期

机译：使用文本和结构化数据的两阶段作者属性方法，用于取消匿名用户生成的内容
3. Selection of correction candidates for the normalization of Spanish user-generated content [J] . M. MELERO, M.R. COSTA-JUSSA, P. LAMBERT, Natural language engineering . 2016,第JANaPTa1期

机译：选择用于西班牙用户生成内容规范化的更正候选者
4. Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content [C] . Claudia Matos Veliz, Orphee De Clercq, Veronique Hoste Workshop on noisy user-generated text . 2019

机译：数据增强对基于NMT的用户生成内容进行文本规范化的好处
5. Cross-Cultural Studies of the User-Generated Content in the US and China: The Application of Text Analytics in Marketing [D] . Fu, Ning. 2020

机译：美国和中国用户生成内容的跨文化研究：文本分析在营销中的应用
6. Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation [O] . Xinghua Lu, Bin Zheng, Atulya Velivelli, 2006

机译：通过丰富的语义表示和训练数据增强来增强文本分类
7. USING META-DATA FROM FREE-TEXT USER-GENERATED CONTENT TO IMPROVE PERSONALIZED RECOMMENDATION BY REDUCING SPARSITY [O] . XU XIAOYING 2015

机译：使用自由用户生成的内容中的元数据通过减少稀疏性来改善个性化建议

Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

摘要

著录项

相似文献

相关主题

期刊订阅