首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >On Fault Tolerance for Distributed Iterative Dataflow Processing
【24h】

On Fault Tolerance for Distributed Iterative Dataflow Processing

机译:分布式迭代数据流处理的容错性

获取原文
获取原文并翻译 | 示例

摘要

Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and model evaluation. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the entire pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the data until convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is critical for efficient analysis. In this paper, we propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributed dataflow systems. We seek to reduce checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that does not break pipelined tasks. In contrast to the conventional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation during iterative graph processing. Moreover, we are able to rapidly rebound, via confined recovery, by exploiting the fact that log files exist locally on healthy nodes and managing to avoid a complete recomputation from scratch. In addition, we propose replica recovery for machine learning algorithms, whereby we employ a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. In order to evaluate our fault tolerance strategies, we conduct both a theoretical study and experimental analyses using Apache Flink and discover that they outperform blocking checkpointing and complete recovery.
机译:大规模图形和机器学习分析广泛采用分布式迭代处理。通常,这些分析是全面工作流程的一部分,该工作流程包括数据准备,模型构建和模型评估。通用分布式数据流框架全面执行此类工作流的所有步骤。这种整体视图使这些系统能够推理并自动优化整个管道。在此,众所周知,图形和机器学习分析需要较长的运行时间,因为它们需要对数据进行多次传递,直到达到收敛为止。因此,容错能力和从任何间歇性故障中快速恢复对于有效分析至关重要。在本文中,我们为在分布式数据流系统上运行的图和机器学习分析提出了新颖的容错机制。我们力求减少检查点成本并缩短故障恢复时间。对于图处理,我们的机制不是写阻塞阻塞下游运算符的检查点,而是以不阻塞流水线任务的非阻塞方式写入检查点。与取消阻塞检查点的常规方法相反(例如,针对不可变数据集独立管理检查点),我们将可变数据集的检查点注入到迭代数据流本身中。因此,我们的机制在设计上是迭代感知的。这简化了系统架构,并有助于在迭代图处理期间协调检查点的创建。此外,通过利用日志文件本地存在于运行状况良好的节点这一事实,并通过避免从头开始进行完全重新计算,我们能够通过有限的恢复迅速反弹。此外,我们提出了针对机器学习算法的副本恢复,从而采用了广播变量,该变量使我们能够快速恢复而无需引入任何检查点。为了评估我们的容错策略,我们使用Apache Flink进行了理论研究和实验分析,发现它们的性能优于阻塞检查点并可以完全恢复。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号