We investigate some pitfalls regarding thediscriminatory power of MT evaluationmetrics and the accuracy of statistical significancetests. In a discriminative rerankingexperiment for phrase-based SMT weshow that the NIST metric is more sensitivethan BLEU or F-score despite their incorporationof aspects of fluency or meaningadequacy into MT evaluation. In anexperimental comparison of two statisticalsignificance tests we show that p-valuesare estimated more conservatively by approximaterandomization than by bootstraptests, thus increasing the likelihoodof type-I error for the latter. We pointout a pitfall of randomly assessing significancein multiple pairwise comparisons,and conclude with a recommendation tocombine NIST with approximate randomization,at more stringent rejection levelsthan is currently standard.
展开▼