首页> 外文期刊>Information Retrieval >An analysis of human factors and label accuracy in crowdsourcing relevance judgments
【24h】

An analysis of human factors and label accuracy in crowdsourcing relevance judgments

机译:众包相关性判断中的人为因素和标签准确性分析

获取原文
获取原文并翻译 | 示例
           

摘要

Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals—the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon’s Mechanical Turk, conducted to explore the ‘human’ characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.
机译:用于评估搜索引擎的众包相关性判断已越来越多地用于解决可伸缩性问题,该问题阻碍了依赖固定的一组可靠专家法官的传统方法。但是,众包的好处伴随着一个自我形成的个人群体的参与而带来的风险-人群受到不同动机的激励,他们以不同程度的关注和成功来完成任务。这就需要精心设计众包任务,以吸引给定任务的正确人群并促进高质量的工作。在本文中,我们描述了一系列使用Amazon的Mechanical Turk进行的实验,旨在探索参与相关性评估任务的人群的“人性”特征。在实验中,我们改变了提供的工资水平,完成一项任务所需的工作量以及工人的资格要求。我们观察了这些变量对相关标签质量的影响,这些结果标签是根据与黄金集达成的协议进行衡量的,并将它们与各种人为因素的自我报告指标相关联。我们从工人那里得到有关他们的动机,兴趣和对主题的熟悉程度,感知到的任务难度以及对所提供工资的满意度的信息。我们研究了这些因素如何与任务设计的各个方面相结合,以及它们如何影响相关标签的准确性。基于对960个HIT和2880个HIT分配的分析,得出19200个相关标签,我们得出了对观察到的因素的复杂相互作用的见解,并为众包从业人员提供了实用指南。此外,我们强调了数据分析中的挑战,这些挑战源于众包环境的特殊性,在这种环境中,从事特定工作条件的个人样本会受到条件本身的内在影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号