首页> 外文期刊>Computational linguistics >Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments
【24h】

Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments

机译:将MT评估指标发挥到极致:超越与人类判断的关联

获取原文
           

摘要

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.
机译:机器翻译自动(MT)评估是一个活跃的研究领域,每年都会设计一些新的指标。评估指标通常以对翻译质量的手动评估为基准,其性能是根据与人类得分的整体相关性来衡量的。许多工作致力于改善评估指标,以实现与人类判断的更高的相关性。但是,关于现有方法的弱点和优势以及它们在不同环境中的行为的见解很少。在这项工作中,我们针对三个主要方面的各种评估指标的表现进行了广泛的元评估研究。首先,当面临不同水平的翻译质量时,我们分析了指标的性能,提出了一种局部依赖度量来替代标准的全局相关系数。我们表明,度量标准的性能在MT质量的不同水平上有很大的不同:面对低质量的翻译时,度量标准的表现很差,并且无法捕获细微的质量差异。有趣的是,我们表明评估低质量的翻译对人类也更具挑战性。其次,我们表明,评估神经MT的指标比传统的统计MT系统更可靠。最后,我们表明,即使黄金标准分数基于不同的标准,不同指标的评估准确性也会保持差异。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号