首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Subword Rcgularization: Improving Neural Network Translation Models with Multiple Subword Candidates
【24h】

Subword Rcgularization: Improving Neural Network Translation Models with Multiple Subword Candidates

机译:子词标准化:改进具有多个子词候选的神经网络翻译模型

获取原文

摘要

Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regu-larization method, subword regularization. which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
机译:子词单元是缓解神经机器翻译(NMT)中开放词汇问题的有效方法。虽然句子通常会转换为唯一的子词序列,但子词分段可能会模棱两可,即使使用相同的词汇,也可能会进行多个分段。本文解决的问题是,是否有可能利用分割模糊性作为噪声来提高NMT的鲁棒性。我们提出一种简单的规则化方法,子词正则化。使用训练期间概率采样的多个子词分割来训练模型。此外,为了更好地进行子词采样,我们提出了一种基于字母组合语言模型的新子词分割算法。我们尝试了多种语料库,并报告了一致的改进,尤其是在资源不足和域外设置方面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号