首页> 外文会议>Workshop on Structured Prediction for NLP >On the Discrepancy between Density Estimation and Sequence Generation
【24h】

On the Discrepancy between Density Estimation and Sequence Generation

机译:关于密度估计与序列生成之间的差异

获取原文

摘要

Many sequence-to-sequence generation tasks, including machine translation and text-to-speech, can be posed as estimating the density of the output y given the input x: p(y|x). Given this interpretation, it is natural to evaluate sequence-to-sequence models using conditional log-likelihood on a test set. However, the goal of sequence-to-sequence generation (or structured prediction) is to find the best output (^y) given an input x, and each task has its own downstream metric R that scores a model output by comparing against a set of references y~*: R((^y),y~*|x). While we hope that a model that excels in density estimation also performs well on the downstream metric, the exact correlation has not been studied for sequence generation tasks. In this paper, by comparing several density estimators on five machine translation tasks, we find that the correlation between rankings of models based on log-likelihood and BLEU varies significantly depending on the range of the model families being compared. First, log-likelihood is highly correlated with BLEU when we consider models within the same family (e.g. au-toregressive models, or latent variable models with the same parameterization of the prior). However, we observe no correlation between rankings of models across different families: (1) among non-autoregressive latent variable models, a flexible prior distribution is better at density estimation but gives worse generation quality than a simple prior, and (2) autore-gressive models offer the best translation performance overall, while latent variable models with a normalizing flow prior give the highest held-out log-likelihood across all datasets.
机译:许多序列到序列生成任务,包括机器转换和文本到语音,可以估计给定输入x:p(y | x)的输出y的密度。鉴于此解释,它是自然的,在测试集上使用条件对数似然评估序列到序列模型。但是,序列到序列生成(或结构化预测)的目标是找到给定输入x的最佳输出(^ y),并且每个任务都有其自己的下游度量标准r,通过与集合进行比较来分量模型输出。参考文献Y〜*:R((^ y),y〜* | x)。虽然我们希望在下游指标上表现密度估计中超出的模型,但尚未研究序列生成任务的确切相关性。在本文中,通过比较五个机器翻译任务的若干密度估计,我们发现基于日志似然和BLEU的模型排名之间的相关性,这取决于所比较的模型系列的范围。首先,当我们考虑同一家庭内的模型(例如,具有先前相同参数化的潜伏变量模型或具有相同参数化的潜在变量模型时,Log-似然与Bleu高度相关。然而,我们观察不同家庭的模型排名之间的相关性:(1)在非自动增加潜变量模型中,灵活的先前分配在密度估计下更好,但是给出比简单的先前的发电质量更差,(2)自动Greessive Models总体提供了最佳的翻译性能,而具有规范化流程的潜在变量模型在所有数据集中提供了最高的滞留日志可能性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号