首页> 外文会议>International conference on very large data bases >The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
【24h】

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

机译:数据流模型:一种在大规模,无界,无序数据处理中平衡正确性,延迟和成本的实用方法

获取原文

摘要

Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems. We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost. In this paper, we present one such approach, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.
机译:无限制,无序的全球规模数据集在日常业务中越来越普遍(例如,Web日志,移动使用情况统计信息和传感器网络)。同时,这些数据集的消费者已经发展出复杂的要求,例如事件时间顺序和按数据本身的特征进行窗口化,以及对快速答案的无限渴望。同时,实用性表明,对于这些类型的输入,永远无法在正确性,等待时间和成本的所有维度上进行全面优化。结果,数据处理从业人员对如何调和这些看似相互竞争的主张之间的紧张关系感到困惑,这常常导致不同的实现和系统。我们建议必须对方法进行根本转变,以应对现代数据处理中这些不断发展的要求。作为一个领域,我们必须停止尝试将无边界的数据集整理成有限的信息池,这些信息池最终将变得完整,并在假设我们永远不知道是否或何时看到所有数据的前提下生活和呼吸,只有新数据才能到时,旧数据可能会被收回,而使该问题易于解决的唯一方法是通过原则上的抽象,使从业人员可以沿着感兴趣的轴选择适当的权衡:正确性,等待时间和成本。在本文中,我们介绍了一种这样的方法,即数据流模型,以及对其启用的语义的详细检查,对指导其设计的核心原则的概述,以及通过真实世界的经验对模型本身的验证。导致了它的发展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号