【24h】

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

机译:OpenMeva:评估开放式故事生成度量的基准

获取原文

摘要

Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA. a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.
机译:自动指标对于开发自然语言生成(NLG)模型至关重要,特别是对于诸如故事生成的开放式语言生成任务。然而,观察到现有的自动度量与人类评估相关。缺乏标准化的基准数据集使得难以充分评估度量标准的能力和相当比较不同的指标。因此,我们提出了OpenMeva。评估开放式故事生成度量的基准。 OpenMeva提供了一个全面的测试套件,以评估指标的能力,包括(a)与人类判断的相关性,(b)对不同模型输出和数据集的概括,(c)判断故事一致性的能力,(d)对扰动的鲁棒性。为此,OpenMeva包括手动注释的故事和自动构造的测试示例。我们评估OpenMeva上的现有指标,并观察他们与人类判断相关不良,未能识别话语级别不一致,缺乏推论知识(例如,事件之间的因果秩序),泛化能力和鲁棒性。我们的研究提出了在进一步研究中开发NLG模型和指标的见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号