首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Neural Machine Translation of Rare Words with Subword Units
【24h】

Neural Machine Translation of Rare Words with Subword Units

机译:具有子词单位的稀有词的神经机器翻译

获取原文

摘要

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character 71-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English→German and English→Russian by up to 1.1 and 1.3 Bleu, respectively.
机译:神经机器翻译(NMT)模型通常以固定的词汇量运行,但是翻译是一个开放词汇的问题。先前的工作是通过退回到字典来解决词汇外单词的翻译。在本文中,我们介绍了一种更简单,更有效的方法,通过将稀有和未知词编码为子词单元序列,使NMT模型能够进行词汇翻译。这是基于这样的直觉,即可以通过比单词小的单位来翻译各种单词类别,例如名称(通过字符复制或音译),复合词(通过组成翻译)以及同源词和借词(通过语音和词法转换)。我们讨论了不同的分词技术的适用性,包括简单字符71语法模型和基于字节对编码压缩算法的分词,并通过经验证明子词模型在WMT 15翻译任务的基础上比后退字典基线有所改进。 →德语和英语→俄语,最多分别达到1.1和1.3 Bleu。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号