首页> 外文会议>Workshop on noisy user-generated text >Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

【24h】

Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

机译：数据增强对基于NMT的用户生成内容进行文本规范化的好处

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

One of the most persistent characteristics of written user-generated content (UGC) is the use of non-standard words. This characteristic contributes to an increased difficulty to automatically process and analyze UGC. Text normalization is the task of transforming lexical variants to their canonical forms and is often used as a pre-processing step for conventional NLP tasks in order to overcome the performance drop that NLP systems experience when applied to UGC. In this work, we follow a Neural Machine Translation approach to text normalization. To train such an encoder-decoder model, large parallel training corpora of sentence pairs are required. However, obtaining large data sets with UGC and their normalized version is not trivial, especially for languages other than English. In this paper, we explore how to overcome this data bottleneck for Dutch, a low-resource language. We start off with a small publicly available parallel Dutch data set comprising three UGC genres and compare two different approaches. The first is to manually normalize and add training data, a money and time-consuming task. The second approach is a set of data augmentation techniques which increase data size by converting existing resources into synthesized non-standard forms. Our results reveal that, while the different approaches yield similar results regarding the normalization issues in the test set, they also introduce a large amount of over-normalizations.

机译：书面用户生成内容（UGC）的最持久特征之一是使用非标准单词。此特征导致自动处理和分析UGC的难度增加。文本规范化是将词汇变体转换为规范形式的任务，通常被用作常规NLP任务的预处理步骤，以克服NLP系统应用于UGC时性能下降的问题。在这项工作中，我们遵循神经机器翻译方法进行文本规范化。为了训练这种编码器-解码器模型，需要句子对的大型并行训练语料库。但是，使用UGC及其规范化版本获取大数据集并非易事，特别是对于英语以外的语言。在本文中，我们探索了如何克服这种资源匮乏的荷兰语的数据瓶颈。我们从一个小型的，可公开获得的平行荷兰数据集开始，其中包括三个UGC体裁，并比较了两种不同的方法。首先是手动规范化并添加培训数据，这是一项耗时且费时的工作。第二种方法是一组数据增强技术，可通过将现有资源转换为合成的非标准形式来增加数据大小。我们的结果表明，尽管不同的方法在测试集中的归一化问题上产生了相似的结果，但它们也引入了大量的过度归一化。

著录项

来源
《Workshop on noisy user-generated text》|2019年|275-285|共11页
会议地点
作者
Claudia Matos Veliz; Orphee De Clercq; Veronique Hoste;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Multimodular Text Normalization of Dutch User-Generated Content [J] . Schulz Sarah, De Pauw Guy, De Clercq Orphee, ACM transactions on intelligent systems . 2016,第4期

机译：荷兰用户生成内容的多模块文本规范化
2. A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content [J] . Matthew J. Schneider, Shawn Mankad Customer Needs and Solutions . 2021,第3期

机译：使用文本和结构化数据的两阶段作者属性方法，用于取消匿名用户生成的内容
3. Selection of correction candidates for the normalization of Spanish user-generated content [J] . M. MELERO, M.R. COSTA-JUSSA, P. LAMBERT, Natural language engineering . 2016,第JANaPTa1期

机译：选择用于西班牙用户生成内容规范化的更正候选者
4. Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content [C] . Claudia Matos Veliz, Orphee De Clercq, Veronique Hoste Workshop on noisy user-generated text . 2019

机译：数据增强对基于NMT的文本规范化的效益
5. Cross-Cultural Studies of the User-Generated Content in the US and China: The Application of Text Analytics in Marketing [D] . Fu, Ning. 2020

机译：美国和中国用户生成内容的跨文化研究：文本分析在营销中的应用
6. Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation [O] . Xinghua Lu, Bin Zheng, Atulya Velivelli, 2006

机译：通过丰富的语义表示和训练数据增强来增强文本分类
7. USING META-DATA FROM FREE-TEXT USER-GENERATED CONTENT TO IMPROVE PERSONALIZED RECOMMENDATION BY REDUCING SPARSITY [O] . XU XIAOYING 2015

机译：使用自由用户生成的内容中的元数据通过减少稀疏性来改善个性化建议

Benefits of Data Augmentation for NMT-based Text Normalization of User-Generated Content

摘要

著录项

相似文献

相关主题

期刊订阅