Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

机译：在适当评分范围内研究摘要评估指标

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/TAC shared tasks. However, modern systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can't decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.

机译：总而言之，通常基于其与人类判断相关的能力进行比较自动评估度量。遗憾的是，已经创建了少数现有的人工判断数据集作为在DUC / TAC共享任务期间执行的手动评估的副产品。但是，现代系统通常比在这些共享任务时提交的最佳系统更好。我们表明，令人惊讶的是，在这些数据集（平均评分范围）上表现出类似的评估指标在当前系统现在运行的更高评分范围内非常不同意。它是有问题的，因为指标不同意我们无法决定信任哪一个。这是为高额评分摘要提供人类判断的呼吁，因为这将解决对信任指标的辩论。这也很有利于进一步改进摘要系统和指标。

著录项

来源
《Annual meeting of the Association for Computational Linguistics》|2019年|cxxxiv p. 4609-5267|共8页
会议地点
作者
Maxime Peyrard;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Extractive speech summarization using evaluation metric-related training criteria [J] . Berlin Chen, Shih-Hsiang Lin, Yu-Mei Chang, Information Processing & Management . 2013,第1期

机译：使用与评估指标相关的训练标准进行语音提取摘要
2. Smartphone applications for the evaluation of pathologic shoulder range of motion and shoulder scores—a comparative study [J] . Kevyn Mejia-Hernandez, Angela Chang, Nathan Eardley-Harris, JSES Open Access . 2018,第1期

机译：智能手机在病理性肩部运动范围和肩部评分评估中的应用-比较研究
3. Re: The GMS Hypospadias Score: Assessment of Inter-Observer Reliability and Correlation with Post-Operative Complications Re: Introducing the HOPE (Hypospadias Objective Penile Evaluation)-Score: A Validation Study of an Objective Scoring System for Evaluating Cosmetic Appearance in Hypospadias Patients Editorial Comment [J] . Canning Douglas A. The Journal of Urology . 2015,第1期

机译：Re：GMS Hypospadias评分：评估观察者间可靠性和与手术后并发症的相关性Re：介绍希望（Hypospadias目标阴茎评估）-score：腹期下患者中化妆品外观的客观评分系统的验证研究评论
4. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range [C] . Maxime Peyrard Annual meeting of the Association for Computational Linguistics . 2019

机译：在适当的评分范围内研究汇总评估指标
5. A COMPARATIVE STUDY OF STANDARD SCORES, DISCREPANCY SCORES AND RELIABILITY ANALYSES IN THE LEARNING DISABILITIES EVALUATION PROCESS EMPLOYING THE WIDE RANGE ACHIEVEMENT TEST (1978), AND THE WIDE RANGE ACHIEVEMENT TEST-REVISED (1984). [D] . PANEFF, MARY JO. 1987

机译：对采用广泛成就测试（1978）和经过广泛成就测试（1984）进行的学习能力评估过程中的标准得分，差异得分和可靠性分析进行了比较研究。
6. Discovering Defining and Summarizing Persistent Hotspots in SCORE Studies [O] . Nupur Kittur, Carl H. Campbell, Jr., 2020

机译：在SCORE研究中发现定义和汇总持久性热点
7. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range [O] . Maxime Peyrard 2019

机译：在适当评分范围内研究摘要评估指标
8. Text Summarization Evaluation: Correlating Human Performance on an Extrinsic Task with Automatic Intrinsic Metrics [R] . President, S. F. , Dorr, B. J. 2006

机译：文本摘要评估：将外部任务的人员绩效与自动内在度量相关联

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

摘要

著录项

相似文献

相关主题

期刊订阅