首页> 外文会议>Workshop on Human Evaluation of NLP Systems >A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems using Checklist

【24h】

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems using Checklist

机译：用清单挑战NLP系统实际人对循环评估的疗效和挑战的案例研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle 'Checklist', which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems.

机译：尽管最先进的性能，但NLP系统可以在现实世界中脆弱。这通常是由于对模型的能力和局限性的理解不足以及对标准评估基准的繁重依赖性。对减轻这种脆性的非标准评估研究正在增加越来越关注。值得注意的是，行为测试原理“清单”，其解耦了从实现的测试显示了最先进的模型中的显着失败，用于多个任务。在本文中，我们展示了在实际情况下使用清单的案例研究。我们对评估令人反感的内容检测系统进行实验，并使用数据增强技术来使用清单中的洞察力改进模型。我们根据我们的循环评估和改进NLP系统的核对表来阐述挑战和开放问题。

著录项

来源
《Workshop on Human Evaluation of NLP Systems》|2021年|120-130|共11页
会议地点
作者
Shaily Bhatt; Rahul Jain; Sandipan Dandapat; Sunayana Sitaram;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Evaluation of Listeria challenge testing protocols: A practical study using cooked sliced ham. [J] . Everis L, Betts G. Food Control . 2013,第1期

机译：李斯特菌挑战试验方案的评估：使用煮熟的火腿切片进行的实践研究。
2. The case for P2P mobile video system over wireless broadband networks: A practical study of challenges for a mobile video provider [J] . Sun Y., Guo Y., Li Z., Network, IEEE . 2013,第2期

机译：无线宽带网络上的P2P移动视频系统案例：对移动视频提供商的挑战的实践研究
3. A study on the reliability of equipment system through case-study on the manufacture of machinery/electronic equipment using practical QRM (quality, reliability, maintenance) process and evaluation index [J] . Yoon Y. G., Yu S. W., Hyung J. P., Microelectronics & Reliability . 2019,第Sepa期

机译：通过使用实际QRM（质量，可靠性，维护）过程和评估指标对机械/电子设备制造进行案例研究，研究设备系统的可靠性
4. A Case Study on the Efficacy of Error Correction Practice by Using the Automated Writing Evaluation System WRM 2.0 on Chinese College Students' English Writing [C] . Wang Shuwen, Xian Yicai 2011 International Conference on Computational and Information Sciences . 2011

机译：运用自动写作评估系统WRM 2.0对中国大学生英语写作进行错误纠正实践效果的案例研究
5. Teacher Leadership and Teacher Efficacy: A Correlational Study Comparing Teacher Perceptions of Leadership and Efficacy and Teacher Evaluation Scores from the North Carolina Educator Evaluation System. [D] . Guenzler, April M. 2016

机译：教师领导能力和教师效能：一项相关研究，比较了北卡罗来纳州教育工作者评估系统中教师对领导能力和效能的看法以及教师评估分数。
6. Evaluating the Portability of an NLP System for Processing Echocardiograms: A Retrospective Multi-site Observational Study [O] . Prakash Adekkanattu, Guoqian Jiang, Yuan Luo, 2019

机译：评价用于处理超声心动图的NLP系统的便携性：一项回顾性多站点观察研究
7. The Case for P2P Mobile Video System over Wireless Broadband Networks: A Practical Study of Challenges for a Mobile Video Provider [O] . Sun Yi, Guo Yang, Zhang Xiaobing, 2013

机译：无线宽带网络p2p移动视频系统案例：移动视频提供商面临挑战的实用研究

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems using Checklist

摘要

著录项

相似文献

相关主题

期刊订阅