首页> 外文会议>43rd Annual Meeting of the Association for Computational Linguistics: Proceeding of the Conference >On Some Pitfalls in Automatic Evaluation and Significance Testing for MT
【24h】

On Some Pitfalls in Automatic Evaluation and Significance Testing for MT

机译:MT自动评估和意义测试中的一些陷阱

获取原文

摘要

We investigate some pitfalls regarding thediscriminatory power of MT evaluationmetrics and the accuracy of statistical significancetests. In a discriminative rerankingexperiment for phrase-based SMT weshow that the NIST metric is more sensitivethan BLEU or F-score despite their incorporationof aspects of fluency or meaningadequacy into MT evaluation. In anexperimental comparison of two statisticalsignificance tests we show that p-valuesare estimated more conservatively by approximaterandomization than by bootstraptests, thus increasing the likelihoodof type-I error for the latter. We pointout a pitfall of randomly assessing significancein multiple pairwise comparisons,and conclude with a recommendation tocombine NIST with approximate randomization,at more stringent rejection levelsthan is currently standard.
机译:我们调查了有关 MT评估的歧视性 指标和统计显着性的准确性 测试。在有区别的重新排名中 基于短语的SMT实验,我们 表明NIST指标更加敏感 尽管加入了BLEU或F分数 流利或意义的方面 MT评估的充分性。在一个 两种统计的实验比较 显着性检验表明,p值 通过近似保守估计 随机比引导 测试,从而增加了可能性 后者的I型错误。我们指出 摆脱了随机评估重要性的陷阱 在多个成对比较中 并提出以下建议: 将NIST与近似随机化相结合, 在更严格的拒绝水平 比目前的标准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号