首页> 外文会议>Annual meeting of the Association for Computational Linguistics;ACL 2011 >MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames
【24h】

MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames

机译:优点:一种廉价,高精度,半自动的度量标准,用于通过语义框架评估翻译效用

获取原文

摘要

We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, non-automatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semi-automated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.
机译:我们引入了一种新型的半自动化度量标准MEANT,该度量标准通过匹配语义角色填充物来评估翻译效用,从而产生与人类判断以及HTER相关的分数,但人工成本却低得多。随着机器翻译系统在词法选择和流利性方面的改进,基于n-gram的,以流利度为导向的MT评估指标(例如BLEU)的缺点变得越来越明显,这些指标无法正确地评估适当性。但是,像HTER这样的更准确,非自动的,面向充分性的MT评估指标需要大量人力,这成为评估周期的瓶颈。我们首先显示,当使用未经培训的单语阅读器来注释MT输出中的语义角色时,度量标准HMEANT的非自动版本在句子级别与人类适当性判断可达到0.43的相关系数,远远优于仅0.20的BLEU,并且相等到更昂贵的HTER。然后,我们将人类语义角色注释器替换为自动浅层语义解析,以进一步使评估指标自动化,并表明,即使是半自动评估指标,与人类充分性判断也能达到0.34的相关系数,与80%的紧密相关性仍然约为80%尽管评估过程的人工成本更低,但HTER仍然有效。结果表明,与当前广泛使用的自动评估指标相比,我们提出的指标与人类对充足性的判断具有更好的相关性,同时比HTER更具成本效益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号