【24h】

Encoder-Decoder Methods for Text Normalization

机译:文本标准化的编解码器方法

获取原文
获取原文并翻译 | 示例

摘要

Text normalization is the task of mapping non-canonical language, typical of speech transcription and computer-mediated communication, to a standardized writing. It is an up-stream task necessary to enable the subsequent direct employment of standard natural language processing tools and indispensable for languages such as Swiss German, with strong regional variation and no written standard. Text normalization has been addressed with a variety of methods, most successfully with character-level statistical machine translation (CSMT). In the meantime, machine translation has changed and the new methods, known as neural encoder-decoder (ED) models, resulted in remarkable improvements. Text normalization, however, has not yet followed. A number of neural methods have been tried, but CSMT remains the state-of-the-art. In this work, we normalize Swiss German WhatsApp messages using the ED framework. We exploit the flexibility of this framework, which allows us to learn from the same training data in different ways. In particular, we modify the decoding stage of a plain ED model to include target-side language models operating at different levels of granularity: characters and words. Our systematic comparison shows that our approach results in an improvement over the CSMT state-of-the-art.
机译:文本规范化是将非规范语言(通常是语音转录和计算机介导的通信)映射到标准化写作的任务。这是一项后续工作,必须能够随后直接使用标准的自然语言处理工具,这对于诸如瑞士德语之类的语言来说是必不可少的,而且区域差异很大且没有书面标准。文本规范化已通过多种方法解决,最成功的方法是使用字符级统计机器翻译(CSMT)。同时,机器翻译发生了变化,称为神经编码器-解码器(ED)模型的新方法带来了显着的进步。但是,尚未进行文本规范化。已经尝试了许多神经方法,但是CSMT仍然是最新技术。在这项工作中,我们使用ED框架规范了瑞士德语WhatsApp消息。我们利用此框架的灵活性,这使我们能够以不同的方式从相同的培训数据中学习。特别是,我们修改了普通ED模型的解码阶段,以包括以不同粒度(字符和单词)运行的目标端语言模型。我们的系统比较表明,我们的方法比CSMT的最新技术有所改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号