首页> 外文学位 >The Intelligent Management of Crowd-Powered Machine Learning
【24h】

The Intelligent Management of Crowd-Powered Machine Learning

机译:人群驱动机器学习的智能管理

获取原文
获取原文并翻译 | 示例

摘要

Artificial intelligence and machine learning power many technologies today, from spam filters to self-driving cars to medical decision assistants. While this revolution has hugely benefited from algorithmic developments, it also could not have occurred without data, which nowadays is frequently procured at massive scale from crowds. Because data is so crucial, a key next step towards truly autonomous agents is the design of better methods for intelligently managing now-ubiquitous crowd-powered data-gathering processes. This dissertation takes this key next step by developing algorithms for the online and dynamic control of these processes. We consider how to gather data for its two primary purposes: training and evaluation.;In the first part of the dissertation, we develop algorithms for obtaining data for testing. The most important requirement of testing data is that it must be extremely clean. Thus to deal with noisy human annotations, machine learning practitioners typically rely on careful workflow design and advanced statistical techniques for label aggregation. A common process involves designing and testing multiple crowdsourcing workflows for their tasks, identifying the single best-performing workflow, and then aggregating worker responses from redundant runs of that single workflow. We improve upon this process by building two control models: one that allows for switching between many workflows depending on how well a particular workflow is performing for a given example and worker; and one that can aggregate labels from tasks that do not have a finite predefined set of multiple choice answers (e.g., counting tasks). We then implement agents that use our new models to dynamically choose whether to acquire more labels from the crowd or stop, and show that they can produce higher quality labels at a cheaper cost than state-of-the-art baselines.;In the second part of the dissertation, we shift to tackle the second purpose of data: training. Because learning algorithms are often robust to noise, training sets do not necessarily have to be clean and have more complex requirements. We first investigate a tradeoff between size and noise. We survey how inductive bias, worker accuracy, and budget affect whether a larger and noisier training set or a smaller and cleaner one will train better classifiers. We then set up a formal framework for dynamically choosing the next example to label or relabel by generalizing active learning to allow for relabeling, which we call re-active learning, and we design new algorithms for re-active learning that outperform active learning baselines. Finally, we leave the noisy setting and investigate how to collect balanced training sets in domains of varying skew, by considering a setting in which workers can not only label examples, but also generate examples with various distributions. We design algorithms that can intelligently switch between deploying these various worker tasks depending on the skew in the dataset, and show that our algorithms can result in significantly better performance than state-of-the-art baselines.
机译:如今,人工智能和机器学习推动了许多技术的发展,从垃圾邮件过滤器到自动驾驶汽车再到医疗决策助手。尽管这项革命从算法的发展中受益匪浅,但如果没有数据,也就不可能发生革命,而如今,数据却经常从人群中大规模购买。由于数据至关重要,因此要实现真正的自治代理,关键的下一步就是设计更好的方法,以智能地管理现在无处不在的人群驱动的数据收集流程。本文通过开发用于这些过程的在线和动态控制的算法,采取了下一步的关键步骤。我们考虑如何为训练和评估这两个主要目的收集数据。在论文的第一部分,我们开发了用于获取数据以进行测试的算法。测试数据最重要的要求是它必须非常干净。因此,为了处理嘈杂的人类注释,机器学习从业人员通常依靠精心的工作流程设计和先进的统计技术来进行标签聚合。一个常见的过程涉及为其任务设计和测试多个众包工作流,确定单个表现最佳的工作流,然后从该单个工作流的冗余运行中汇总工作人员的响应。我们通过建立两个控制模型来改进此过程:一个控制模型允许在多个工作流之间切换,具体取决于特定工作流对给定示例和工作人员的执行情况。以及可以从没有预定义的有限选择题集的任务中聚合标签的方法(例如,计数任务)。然后,我们实施使用新模型的代理商来动态选择是从人群中获得更多标签还是停下来,并表明他们可以以比最新基准更低的成本生产更高质量的标签。在论文的一部分中,我们转向解决数据的第二个目的:培训。由于学习算法通常对噪声具有鲁棒性,因此训练集不一定必须是干净的且具有更复杂的要求。我们首先研究尺寸和噪声之间的权衡。我们调查归纳偏见,工人准确性和预算如何影响更大,更嘈杂的训练集或更小,更干净的训练集将训练更好的分类器。然后,我们建立了一个正式的框架,通过泛化主动学习以允许重新标记来动态选择下一个要标记或重新标记的示例,我们将其称为“主动学习”,并且我们设计了优于主动学习基准的新的主动学习算法。最后,我们离开嘈杂的环境,研究如何通过考虑工人不仅可以标记示例,还可以生成具有各种分布的示例的环境,在不同的偏斜域中收集平衡的训练集。我们设计的算法可以根据数据集中的偏斜度在部署这些各种工作程序任务之间进行智能切换,并证明我们的算法与最新的基准相比可以显着提高性能。

著录项

  • 作者

    Lin, Christopher H.;

  • 作者单位

    University of Washington.;

  • 授予单位 University of Washington.;
  • 学科 Computer science.;Artificial intelligence.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 175 p.
  • 总页数 175
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:54:26

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号