首页> 外文期刊>The VLDB journal >A game-based framework for crowdsourced data labeling
【24h】

A game-based framework for crowdsourced data labeling

机译:基于游戏的众包数据标签框架

获取原文
获取原文并翻译 | 示例
           

摘要

Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and accuracy. CrowdGame employs two groups of crowd workers: One group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: Rule generator identifies high-quality rules with large coverage, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules with low accuracy. This paper studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and accuracy. We define the loss of a rule by considering the two factors. The second is rule accuracy estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We also develop a hybrid crowd-machine method for effective label assignment under budget-constrained crowdsourcing settings. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.
机译:使用多个类别分配数据的数据标签对于许多应用程序是不可或缺的,例如机器学习和数据集成。但是,现有的标签解决方案均为大型数据集的昂贵成本或产生噪声结果。本文介绍了一种经济有效的标记方法,侧重于标签规则生成问题,旨在产生高质量规则,以便在保持质量的同时降低标签成本。为了解决这个问题,我们首先生成候选规则,然后通过考虑覆盖和准确性来设计基于游戏的众群方法众人来选择高质量规则。众人雇用两组人群工人:一个组回答规则验证任务(无论规则是否有效)播放规则生成器的角色,而另一组回答元度检查任务(数据元组的标签是否正确)播放规则资助的作用。我们让这两组播放了一个双人游戏:规则生成器用大覆盖范围识别高质量规则,而规则资料试图通过检查一些提供足够的证据以拒绝具有低精度的规则来反驳其对手规则生成器。本文研究了人群中的挑战。首先是平衡覆盖和准确性之间的权衡。我们通过考虑两个因素来定义规则的损失。第二是规则准确性估计。我们利用贝叶斯估计来组合规则验证和元组检查任务。第三是选择众包任务以满足基于游戏的框架,以便最大限度地减少损失。我们介绍了最少的策略并开发了高效的任务选择算法。我们还开发了一种混合人群机器方法,可根据预算受限的众群设置进行有效标签分配。我们对实体匹配和关系提取进行实验,结果表明,我们的方法优于最先进的解决方案。

著录项

  • 来源
    《The VLDB journal》 |2020年第6期|1311-1336|共26页
  • 作者单位

    Renmin Univ China Beijing 100872 Peoples R China;

    Renmin Univ China Beijing 100872 Peoples R China;

    Renmin Univ China Beijing 100872 Peoples R China;

    Tsinghua Univ Beijing 100084 Peoples R China;

    Renmin Univ China Beijing 100872 Peoples R China;

    Renmin Univ China Beijing 100872 Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Crowdsourcing; Data labeling; Labeling rules;

    机译:众包;数据标签;标签规则;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号