High-Performance Systems for Crowdsourced Data Analysis

机译：高性能的众包数据分析系统

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In spite of the dramatic recent progress in automated techniques for computer vision and natural language understanding, human effort, often in the form of crowd workers recruited on marketplaces such as Amazon's Mechanical Turk, remains a necessary part of data analysis workflows for machine learning and data cleaning. However, embedding manual steps in automated workflows comes with a performance cost, since humans seldom process data at the speed of computers. In order to rapidly iterate between hypotheses and evidence, data analysts need tools that can provide human processing at close to machine latencies.;In this dissertation, I describe the design, theory, and implementation of performant crowd-powered systems. After discussing the performance implications of involving humans in data analysis workflows, I present an example of a data cleaning system that requires low-latency crowd input. Then, I describe CLAMShell, a system that accurately labels large-scale datasets in one to two minutes, and its evaluation on over a thousand workers processing nearly a quarter million tasks. Next, I consider the design of multi-tenant crowd systems running many heterogeneous applications at once. I describe Cioppino, a system designed to improve throughput and reduce cost in this setting, while taking into account worker preferences. Finally, I explore the theory of identifying fast individuals in an unknown population of workers, which can be modeled as an instance of the infinite-armed bandit problem. The analysis results in novel near-optimal algorithms with applications to broader statistical theory. Together, these components provide for the implementation of human computation systems that are cost-efficient, scalable, and fast enough to integrate into existing data analysis workflows without compromising performance.

机译：尽管最近在用于计算机视觉和自然语言理解的自动化技术方面取得了巨大进步，但通常以在诸如Amazon的Mechanical Turk之类的市场上招募的人群工人的形式进行的人工工作仍然是机器学习和数据的数据分析工作流的必要组成部分清洁。但是，由于人工很少以计算机的速度处理数据，因此将手动步骤嵌入自动化工作流中会带来性能上的损失。为了快速地在假设和证据之间进行迭代，数据分析人员需要能够在接近机器延迟的情况下提供人为处理的工具。在本文中，我将介绍高性能人群驱动系统的设计，理论和实现。在讨论了让人类参与数据分析工作流程对性能的影响之后，我将介绍一个需要低延迟人群输入的数据清理系统的示例。然后，我介绍CLAMShell，这是一个可以在一到两分钟内准确标记大规模数据集的系统，并且它对超过一千名处理近25万项任务的工人进行了评估。接下来，我考虑一下同时运行许多异构应用程序的多租户人群系统的设计。我将介绍Cioppino，该系统旨在在此设置中提高吞吐量并降低成本，同时考虑到工人的偏好。最后，我探索了在未知工人群体中识别快速个体的理论，可以将其建模为无限武装匪徒问题的一个实例。分析得出了新颖的近优算法，并应用于更广泛的统计理论。这些组件共同为人类计算系统的实现提供了成本效益，可扩展性和足够快的速度，以集成到现有的数据分析工作流中而不会影响性能。

著录项

作者
Haas, Daniel.;
展开▼
作者单位

University of California, Berkeley.;

展开▼
授予单位 University of California, Berkeley.;
学科 Computer science.;Information science.;Artificial intelligence.
学位 Ph.D.
年度 2017
页码 153 p.
总页数 153
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A cross-analysis framework for multi-source volunteered, crowdsourced, and authoritative geographic information: The case study of volunteered personal traces analysis against transport network data [J] . Gloria Bordogna, Steven Capelli, Daniele E. Ciriello, Geo-spatial information science . 2018,第3期

机译：多源自愿，众包和权威性地理信息的交叉分析框架：针对运输网络数据的自愿个人痕迹分析的案例研究
2. A cross-analysis framework for multi-source volunteered, crowdsourced, and authoritative geographic information: The case study of volunteered personal traces analysis against transport network data [J] . Gloria Bordogna, Steven Capelli, Daniele E.Ciriello, 地球空间信息科学学报（英文版） . 2018,第003期

机译：多源自愿，众包和权威性地理信息的交叉分析框架：针对运输网络数据的自愿个人痕迹分析的案例研究
3. Research on High-Performance Real-time Data Analysis System Based on Spark Streaming in Big Data Environment [J] . Wang Jialin Basic & clinical pharmacology & toxicology. . 2019,第S3期

机译：基于大型数据环境中火花流的高性能实时数据分析系统研究
4. A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q [C] . Alina Sirbu, Ozalp Babaoglu Workshop on big data management in clouds;Euro-Par 2015 International workshops;Workshop on parallel and distributed computing education for undergraduate students;Workshop on algorithms, models, and tools for parallel computing on heterogeneous platforms;Workshop on large-scale distributed virtual environments;Workshop on on-chip memory hierarchies and interconnects: organization, management and implementation;Workshop on parallel distributed agent-based simulations;Workshop on performance engineering for large-scale graph analytics;Workshop on reproducibility in parallel computing;Workshop on resiliency in high-performance computing with clouds, grids, and clusters;Workshop on runtime and operating systems for the many-core era;Workshop on unconventional high performance computing;Workshop on virtualization in high-performance cloud computing . 2015

机译：高性能计算系统中日志数据分析的整体方法：以IBM Blue Gene / Q为例
5. Adopting and Incorporating Crowdsourced Traffic Data in Advanced Transportation Management Systems [D] . Amin-Naseri, Mostafa 2018

机译：在高级交通管理系统中采用和整合众包交通数据
6. Attitudes Toward Multilingualism in Luxembourg. A Comparative Analysis of Online News Comments and Crowdsourced Questionnaire Data [O] . Christoph Purschke 2020

机译：卢森堡的多语言态度态度。在线新闻评论和众包问卷数据的比较分析
7. A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q [O] . Ozalp Babaoglu 2016

机译：高性能计算系统中日志数据分析的整体方法：IBm Blue Gene / Q案例
8. Crowdsourced Geospatial Data: A Report on the Emerging Phenomena of Crowdsourced and User-Generated Geospatial Data. [R] . Rice, M. T., Paez, F. I., Mulhollen, A. P., 2012

机译：众包地理空间数据：关于众包和用户生成的地理空间数据新兴现象的报告。

High-Performance Systems for Crowdsourced Data Analysis

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅