首页> 外文会议>ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008 >Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers
【24h】

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

机译:获得另一个标签?使用多个嘈杂的标记器提高数据质量和数据挖掘

获取原文

摘要

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results, (ⅰ) Repeated-labeling can improve label quality and model quality, but not always. (ⅱ) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (ⅲ) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (ⅳ) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
机译:当标签不完美时,本文讨论了重复获取数据项标签的问题。我们研究了通过重复标记在数据质量方面的改进(或缺乏改进),尤其关注于有监督归纳的训练标签的改进。随着小任务的外包变得更容易,例如通过Rent-A-Coder或Amazon的Mechanical Turk,可以经常以低成本获得低于专业水平的标签。使用低成本标记,准备数据的未标记部分可能会比​​标记要昂贵得多。我们提出了增加复杂度的重复标签策略,并显示了一些主要结果,(ⅰ)重复标签可以提高标签质量和模型质量,但并非总是如此。 (ⅱ)当标签嘈杂时,即使在标签不是特别便宜的传统环境中,重复标签也可能比单标签更可取。 (ⅲ)一旦处理未标记数据的成本不菲,即使是多次标记所有内容的简单策略也可以提供相当大的优势。 (ⅳ)通常,最好反复标记一组精心选择的点,并且我们提出了一种可靠的技术,该技术结合了不同的不确定性概念来选择需要提高质量的数据点。最重要的是:结果清楚地表明,当标签不够完善时,有选择地获取多个标签是数据挖掘人员应采取的一项策略;对于某些标签质量/成本制度,收益是可观的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号