首页> 外文会议>Workshop on Human Evaluation of NLP Systems >A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems using Checklist
【24h】

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems using Checklist

机译:用清单挑战NLP系统实际人对循环评估的疗效和挑战的案例研究

获取原文

摘要

Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle 'Checklist', which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems.
机译:尽管最先进的性能,但NLP系统可以在现实世界中脆弱。 这通常是由于对模型的能力和局限性的理解不足以及对标准评估基准的繁重依赖性。 对减轻这种脆性的非标准评估研究正在增加越来越关注。 值得注意的是,行为测试原理“清单”,其解耦了从实现的测试显示了最先进的模型中的显着失败,用于多个任务。 在本文中,我们展示了在实际情况下使用清单的案例研究。 我们对评估令人反感的内容检测系统进行实验,并使用数据增强技术来使用清单中的洞察力改进模型。 我们根据我们的循环评估和改进NLP系统的核对表来阐述挑战和开放问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号