首页> 外文学位 >High-Performance Systems for Crowdsourced Data Analysis
【24h】

High-Performance Systems for Crowdsourced Data Analysis

机译:高性能的众包数据分析系统

获取原文
获取原文并翻译 | 示例

摘要

In spite of the dramatic recent progress in automated techniques for computer vision and natural language understanding, human effort, often in the form of crowd workers recruited on marketplaces such as Amazon's Mechanical Turk, remains a necessary part of data analysis workflows for machine learning and data cleaning. However, embedding manual steps in automated workflows comes with a performance cost, since humans seldom process data at the speed of computers. In order to rapidly iterate between hypotheses and evidence, data analysts need tools that can provide human processing at close to machine latencies.;In this dissertation, I describe the design, theory, and implementation of performant crowd-powered systems. After discussing the performance implications of involving humans in data analysis workflows, I present an example of a data cleaning system that requires low-latency crowd input. Then, I describe CLAMShell, a system that accurately labels large-scale datasets in one to two minutes, and its evaluation on over a thousand workers processing nearly a quarter million tasks. Next, I consider the design of multi-tenant crowd systems running many heterogeneous applications at once. I describe Cioppino, a system designed to improve throughput and reduce cost in this setting, while taking into account worker preferences. Finally, I explore the theory of identifying fast individuals in an unknown population of workers, which can be modeled as an instance of the infinite-armed bandit problem. The analysis results in novel near-optimal algorithms with applications to broader statistical theory. Together, these components provide for the implementation of human computation systems that are cost-efficient, scalable, and fast enough to integrate into existing data analysis workflows without compromising performance.
机译:尽管最近在用于计算机视觉和自然语言理解的自动化技术方面取得了巨大进步,但通常以在诸如Amazon的Mechanical Turk之类的市场上招募的人群工人的形式进行的人工工作仍然是机器学习和数据的数据分析工作流的必要组成部分清洁。但是,由于人工很少以计算机的速度处理数据,因此将手动步骤嵌入自动化工作流中会带来性能上的损失。为了快速地在假设和证据之间进行迭代,数据分析人员需要能够在接近机器延迟的情况下提供人为处理的工具。在本文中,我将介绍高性能人群驱动系统的设计,理论和实现。在讨论了让人类参与数据分析工作流程对性能的影响之后,我将介绍一个需要低延迟人群输入的数据清理系统的示例。然后,我介绍CLAMShell,这是一个可以在一到两分钟内准确标记大规模数据集的系统,并且它对超过一千名处理近25万项任务的工人进行了评估。接下来,我考虑一下同时运行许多异构应用程序的多租户人群系统的设计。我将介绍Cioppino,该系统旨在在此设置中提高吞吐量并降低成本,同时考虑到工人的偏好。最后,我探索了在未知工人群体中识别快速个体的理论,可以将其建模为无限武装匪徒问题的一个实例。分析得出了新颖的近优算法,并应用于更广泛的统计理论。这些组件共同为人类计算系统的实现提供了成本效益,可扩展性和足够快的速度,以集成到现有的数据分析工作流中而不会影响性能。

著录项

  • 作者

    Haas, Daniel.;

  • 作者单位

    University of California, Berkeley.;

  • 授予单位 University of California, Berkeley.;
  • 学科 Computer science.;Information science.;Artificial intelligence.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 153 p.
  • 总页数 153
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号