Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments

Marina Fomicheva; Lucia Specia

首页> 外文期刊>Computational linguistics >Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments

【24h】

Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments

机译：将MT评估指标发挥到极致：超越与人类判断的关联

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.

机译：机器翻译自动（MT）评估是一个活跃的研究领域，每年都会设计一些新的指标。评估指标通常以对翻译质量的手动评估为基准，其性能是根据与人类得分的整体相关性来衡量的。许多工作致力于改善评估指标，以实现与人类判断的更高的相关性。但是，关于现有方法的弱点和优势以及它们在不同环境中的行为的见解很少。在这项工作中，我们针对三个主要方面的各种评估指标的表现进行了广泛的元评估研究。首先，当面临不同水平的翻译质量时，我们分析了指标的性能，提出了一种局部依赖度量来替代标准的全局相关系数。我们表明，度量标准的性能在MT质量的不同水平上有很大的不同：面对低质量的翻译时，度量标准的表现很差，并且无法捕获细微的质量差异。有趣的是，我们表明评估低质量的翻译对人类也更具挑战性。其次，我们表明，评估神经MT的指标比传统的统计MT系统更可靠。最后，我们表明，即使黄金标准分数基于不同的标准，不同指标的评估准确性也会保持差异。

著录项

来源
《Computational linguistics》 |2019年第3期|共44页
作者
Marina Fomicheva; Lucia Specia;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Evaluation of WRF-CMAQ simulated climatological mean and extremes of fine particulate matter of the United States and its correlation with climate extremes [J] . Li Xueke, Seth Anji, Zhang Chuanrong, Atmospheric environment . 2020,第Feba期

机译：对WRF-CMAQ模拟的对策，对美国精细颗粒物质的模拟气候平均值和极值及其与气候极值的相关性
2. Evaluation of WRF-CMAQ simulated climatological mean and extremes of fine particulate matter of the United States and its correlation with climate extremes [J] . Li Xueke, Seth Anji, Zhang Chuanrong, Autonomic neuroscience: basic & clinical . 2019,第期

机译：WRF-CMAQ的评估模拟美国精细颗粒物质的气候平均值和极端及其与气候极值的相关性
3. Application and Evaluation of an Expert Judgment Elicitation Procedure for Correlations [J] . Mari??lle Zondervan-Zwijnenburg, Wenneke van de Schoot-Hubeek, Kimberley Lek, Frontiers in Psychology . 2017,第4期

机译：相关性的专家判断引发程序的应用和评估
4. METEOR: An Automatic Metric for MT Evaluation withImproved Correlation with Human Judgments [C] . Satanjeev Banerjee, Alon Lavie 43rd Annual Meeting of the Association for Computational Linguistics: Proceeding of the Conference . 2005

机译：流星：一种改进的与人类判断相关的MT评估自动度量
5. Understanding the Phish: Using Judgment Analysis to Evaluate the Human Judgment of Phishing Emails [D] . Molinaro, Kylie Ann 2019

机译：了解网络钓鱼：使用判断分析评估网络钓鱼电子邮件的人工判断
6. Application and Evaluation of an Expert Judgment Elicitation Procedure for Correlations [O] . Mariëlle Zondervan-Zwijnenburg, Wenneke van de Schoot-Hubeek, Kimberley Lek, -1

机译：相关性的专家判断引发程序的应用和评估
7. Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments [O] . Marina Fomicheva, Lucia Specia 2019

机译：将MT评估指标用于极端：超越与人类判断的相关性

Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments

摘要

著录项

相似文献

相关主题

期刊订阅