OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

机译：OpenMeva：评估开放式故事生成度量的基准

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA. a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.

机译：自动指标对于开发自然语言生成（NLG）模型至关重要，特别是对于诸如故事生成的开放式语言生成任务。然而，观察到现有的自动度量与人类评估相关。缺乏标准化的基准数据集使得难以充分评估度量标准的能力和相当比较不同的指标。因此，我们提出了OpenMeva。评估开放式故事生成度量的基准。 OpenMeva提供了一个全面的测试套件，以评估指标的能力，包括（a）与人类判断的相关性，（b）对不同模型输出和数据集的概括，（c）判断故事一致性的能力，（d）对扰动的鲁棒性。为此，OpenMeva包括手动注释的故事和自动构造的测试示例。我们评估OpenMeva上的现有指标，并观察他们与人类判断相关不良，未能识别话语级别不一致，缺乏推论知识（例如，事件之间的因果秩序），泛化能力和鲁棒性。我们的研究提出了在进一步研究中开发NLG模型和指标的见解。

著录项

来源
《International Joint Conference on Natural Language Processing;Annual Meeting of the Association for Computational Linguistics》|2021年|6394-6407|共14页
会议地点
作者
Jian Guan; Zhexin Zhang; Zhuoer Feng; Zitao Liu; Wenbiao Ding; Xiaoxi Mao; Changjie Fan; Minlie Huang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. AMADEE-18 and the Analog Mission Performance Metrics Analysis:A Benchmarking Tool for Mission Planning and Evaluation [J] . Gruber Sophie, Groemer Gernot, Paternostro Simone, Astrobiology . 2020,第11期

机译：Amadee-18和模拟使命性能指标分析：特派团规划和评估的基准工具
2. A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets [J] . Karanam Srikrishna, Gou Mengran, Wu Ziyan, IEEE Transactions on Pattern Analysis and Machine Intelligence . 2019,第3期

机译：人员重新识别的系统评估和基准：功能，指标和数据集
3. Evaluation of metrics for benchmarking antimicrobial use in the UK dairy industry [J] . Mills Harriet L., Turner Andrea, Morgans Lisa, The Veterinary Record . 2018,第13期

机译：英国乳业基准抗菌用途测定度量评价
4. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation [C] . Jian Guan, Minlie Huang Conference on Empirical Methods in Natural Language Processing . 2020

机译：联盟：评估开放式故事生成的未引用度量
5. Transport effort: A metric for the evaluation and benchmarking of automotive assembly plants. [D] . Sly, David Paul. 2004

机译：运输工作量：汽车装配厂评估和基准测试的指标。
6. Statistical air quality predictions for public health surveillance: evaluation and generation of county level metrics of PM2.5 for the environmental public health tracking network [O] . Ambarish Vaidyanathan, William Fred Dimmick, Scott R Kegler, 2013

机译：公共卫生监测的统计空气质量预测：为环境公共卫生跟踪网络评估和生成县级PM2.5指标
7. A New Benchmark for NLP in Social Sciences: Evaluating the Usefulness of Pre-trained Language Models for Classifying Open-ended Survey Responses [O] . Maximilian Meidinger, Matthias Aßenmacher 2021

机译：社会科学中NLP的新基准：评估预先训练的语言模型对分类开放式调查响应的有用性
8. Conceptual Soundness, Metric Development, Benchmarking, and Targeting for PATH Subprogram Evaluation [R] . Mosey, G., Coris, E., Coggeshall, C., 2009

机译：paTH子程序评估的概念稳健性，度量标准开发，基准测试和目标定位

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

摘要

著录项

相似文献

相关主题

期刊订阅