Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

机译：获得另一个标签？使用多个嘈杂的标记器提高数据质量和数据挖掘

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results, (ⅰ) Repeated-labeling can improve label quality and model quality, but not always. (ⅱ) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (ⅲ) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (ⅳ) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

机译：当标签不完美时，本文讨论了重复获取数据项标签的问题。我们研究了通过重复标记在数据质量方面的改进（或缺乏改进），尤其关注于有监督归纳的训练标签的改进。随着小任务的外包变得更容易，例如通过Rent-A-Coder或Amazon的Mechanical Turk，可以经常以低成本获得低于专业水平的标签。使用低成本标记，准备数据的未标记部分可能会比标记要昂贵得多。我们提出了增加复杂度的重复标签策略，并显示了一些主要结果，（ⅰ）重复标签可以提高标签质量和模型质量，但并非总是如此。（ⅱ）当标签嘈杂时，即使在标签不是特别便宜的传统环境中，重复标签也可能比单标签更可取。（ⅲ）一旦处理未标记数据的成本不菲，即使是多次标记所有内容的简单策略也可以提供相当大的优势。（ⅳ）通常，最好反复标记一组精心选择的点，并且我们提出了一种可靠的技术，该技术结合了不同的不确定性概念来选择需要提高质量的数据点。最重要的是：结果清楚地表明，当标签不够完善时，有选择地获取多个标签是数据挖掘人员应采取的一项策略;对于某些标签质量/成本制度，收益是可观的。

著录项

来源
《ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008》|2008年|596-604|共9页
会议地点
作者
Victor S. Sheng; Foster Provost; Panagiotis G. Ipeirotis;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息与知识传播;
关键词
data selection; data preprocessing;

机译：数据选择;数据预处理;

相似文献

外文文献
中文文献
专利

1. Class label prediction of coal mining data base on data mining method [J] . Basic & clinical pharmacology & toxicology. . 2019,第S10期

机译：数据挖掘方法煤炭挖掘数据库的类标签预测
2. Class label prediction of coal mining data base on data mining method [J] . Lei Lin, Xu Yaqing, Huang Chunguo, Basic & clinical pharmacology & toxicology. . 2019,第S1期

机译：数据挖掘方法煤炭挖掘数据库的类标签预测
3. Highly Sensitive Quantification Method for Amine Submetabolome Based on AQC-Labeled-LC-Tandem-MS and Multiple Statistical Data Mining: A Potential Cancer Screening Approach [J] . Zhang Qian, Xu Huarong, Liu Ran, Analytical chemistry . 2018,第20期

机译：基于AQC标记-LC-TANDEM-MS和多种统计数据挖掘的胺inupetabolome的高灵敏度定量方法：潜在的癌症筛查方法
4. Get another label? improving data quality and data mining using multiple, noisy labelers [C] . Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis ACM SIGKDD international conference on Knowledge discovery and data mining . 2008

机译：得到另一个标签？使用多个嘈杂的标记器改善数据质量和数据挖掘
5. Real-time tracking and guided mass labeling of low resolution data in noisy environments. [D] . Luczynski, Bartosz. 2016

机译：在嘈杂的环境中对低分辨率数据进行实时跟踪和引导批量标记。
6. Gene Expression Data Analysis Using Closed Itemset Mining for Labeled Data [O] . Ana Rotter, Petra Kralj Novak, Špela Baebler, -1

机译：使用封闭项集挖掘标记数据的基因表达数据分析
7. Get another label? Improving data quality and data mining using multiple, noisy labelers [O] . Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis 2012

机译：获得另一个标签？使用多个嘈杂的贴标机提高数据质量和数据挖掘

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅