Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

机译：我们是在建模任务还是在注释器上？自然语言理解数据集中注释者偏见的调查

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. A common crowdsourcing practice is to recruit a small number of high-quality workers, and have them massively generate examples. Having only a few workers generate the majority of examples raises concerns about data diversity, especially when workers freely generate sentences. In this paper, we perform a series of experiments showing these concerns are evident in three recent NLP datasets. We show that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators. Moreover, we show that often models do not generalize well to examples from annotators that did not contribute to the training set. Our findings suggest that annotator bias should be monitored during dataset creation, and that test set annotators should be disjoint from training set annotators.

机译：近年来，众包已经成为创建自然语言理解数据集的流行范例。一种常见的众包实践是招募少量高素质的工人，并让他们大量产生榜样。仅由少数几个工人生成大多数示例会引起人们对数据多样性的担忧，尤其是当工人自由生成句子时。在本文中，我们进行了一系列实验，这些实验表明在最近的三个NLP数据集中这些担忧是显而易见的。我们表明，在将注释器标识符作为特征进行训练时，模型性能会提高，并且模型能够识别出最高效的注释器。此外，我们表明，模型通常不能很好地推广到注释者的示例中，这些注释对训练集没有帮助。我们的发现表明，在数据集创建过程中应监控注释者偏差，并且测试集注释者应与训练集注释者脱节。

著录项

来源
《International joint conference on natural language processing;Conference on empirical methods in natural language processing》|2019年|1161-1166|共6页
会议地点
作者
Mor Geva; Yoav Goldberg; Jonathan Berant;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Creating a richly annotated corpus of papyrological Greek: The possibilities of natural language processing approaches to a highly inflected historical language [J] . Keersmaekers Alek Trends in Ecology & Evolution . 2020,第1期

机译：创造了纸质希腊语的丰富注释的语料：自然语言处理对高度变形的历史语言的可能性
2. Creating a richly annotated corpus of papyrological Greek: The possibilities of natural language processing approaches to a highly inflected historical language [J] . Keersmaekers Alek Digital scholarship in the humanities . 2020,第1期

机译：创造了纸质希腊语的丰富注释的语料：自然语言处理对高度变形的历史语言的可能性
3. A Methodology to Annotate Systems Biology Markup Language Models with the Synthetic Biology Open Language [J] . Roehner Nicholas, Myers Chris J. ACS Synthetic Biology . 2014,第2期

机译：用合成生物学开放语言注释系统生物学标记语言模型的方法论
4. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets [C] . Mor Geva, Yoav Goldberg, Jonathan Berant International joint conference on natural language processing . 2019

机译：我们是否在建模任务或注释器？自然语言理解数据集的注释偏差调查
5. Understanding the Importance of Entities and Roles in Natural Language Inference : A Model and Datasets [D] . Shrivastava, Ishan. 2019

机译：了解实体和角色在自然语言推理中的重要性：模型和数据集
6. MEGGASENSE – The Metagenome/Genome Annotated Sequence Natural Language Search Engine: A Platform for the Construction of Sequence Data Warehouses [O] . Ranko Gacesa, Jurica Zucko, Solveig K. Petursdottir, 2017

机译：MEGGASENSE –元基因组/基因组注释序列自然语言搜索引擎：构建序列数据仓库的平台
7. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets [O] . Mor Geva, Yoav Goldberg, Jonathan Berant 2019

机译：我们是否在建模任务或注释器？自然语言理解数据集的注释偏差调查
8. An Annotated Bibliography of Natural Language and Speech Understanding Systems. [R] . Kender, J. R. 1977

机译：自然语言和语音理解系统的注释书目。

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅