首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Studying Summarization Evaluation Metrics in the Appropriate Scoring Range
【24h】

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

机译:在适当评分范围内研究摘要评估指标

获取原文

摘要

In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/TAC shared tasks. However, modern systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can't decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.
机译:总而言之,通常基于其与人类判断相关的能力进行比较自动评估度量。遗憾的是,已经创建了少数现有的人工判断数据集作为在DUC / TAC共享任务期间执行的手动评估的副产品。但是,现代系统通常比在这些共享任务时提交的最佳系统更好。我们表明,令人惊讶的是,在这些数据集(平均评分范围)上表现出类似的评估指标在当前系统现在运行的更高评分范围内非常不同意。它是有问题的,因为指标不同意我们无法决定信任哪一个。这是为高额评分摘要提供人类判断的呼吁,因为这将解决对信任指标的辩论。这也很有利于进一步改进摘要系统和指标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号