On Fault Tolerance for Distributed Iterative Dataflow Processing

Chen Xu; Markus Holzemer; Manohar Kaul; Juan Soto; Volker Markl

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >On Fault Tolerance for Distributed Iterative Dataflow Processing

【24h】

On Fault Tolerance for Distributed Iterative Dataflow Processing

机译：分布式迭代数据流处理的容错性

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and model evaluation. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the entire pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the data until convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is critical for efficient analysis. In this paper, we propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributed dataflow systems. We seek to reduce checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that does not break pipelined tasks. In contrast to the conventional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation during iterative graph processing. Moreover, we are able to rapidly rebound, via confined recovery, by exploiting the fact that log files exist locally on healthy nodes and managing to avoid a complete recomputation from scratch. In addition, we propose replica recovery for machine learning algorithms, whereby we employ a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. In order to evaluate our fault tolerance strategies, we conduct both a theoretical study and experimental analyses using Apache Flink and discover that they outperform blocking checkpointing and complete recovery.

机译：大规模图形和机器学习分析广泛采用分布式迭代处理。通常，这些分析是全面工作流程的一部分，该工作流程包括数据准备，模型构建和模型评估。通用分布式数据流框架全面执行此类工作流的所有步骤。这种整体视图使这些系统能够推理并自动优化整个管道。在此，众所周知，图形和机器学习分析需要较长的运行时间，因为它们需要对数据进行多次传递，直到达到收敛为止。因此，容错能力和从任何间歇性故障中快速恢复对于有效分析至关重要。在本文中，我们为在分布式数据流系统上运行的图和机器学习分析提出了新颖的容错机制。我们力求减少检查点成本并缩短故障恢复时间。对于图处理，我们的机制不是写阻塞阻塞下游运算符的检查点，而是以不阻塞流水线任务的非阻塞方式写入检查点。与取消阻塞检查点的常规方法相反（例如，针对不可变数据集独立管理检查点），我们将可变数据集的检查点注入到迭代数据流本身中。因此，我们的机制在设计上是迭代感知的。这简化了系统架构，并有助于在迭代图处理期间协调检查点的创建。此外，通过利用日志文件本地存在于运行状况良好的节点这一事实，并通过避免从头开始进行完全重新计算，我们能够通过有限的恢复迅速反弹。此外，我们提出了针对机器学习算法的副本恢复，从而采用了广播变量，该变量使我们能够快速恢复而无需引入任何检查点。为了评估我们的容错策略，我们使用Apache Flink进行了理论研究和实验分析，发现它们的性能优于阻塞检查点并可以完全恢复。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2017年第8期|1709-1722|共14页
作者
Chen Xu; Markus Holzemer; Manohar Kaul; Juan Soto; Volker Markl;
展开▼
作者单位

TU Berlin, Berlin, Germany;

Acellere, Frankfurt, Hessen, Germany;

IIT Hyderabad, Telangana, India;

TU Berlin, Berlin, Germany;

TU Berlin, Berlin, Germany;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Fault tolerance; distributed data processing; iterative computation; graph processing; machine learning analytics;

机译：容错;分布式数据处理;迭代计算;图形处理;机器学习分析;

相似文献

外文文献
中文文献
专利

1. Designing Fault Tolerance Strategy by Iterative Redundancy for Component-Based Distributed Computing Systems [J] . HuiWang, YunWang Mathematical Problems in Engineering: Theory, Methods and Applications . 2014,第1期

机译：基于组件的分布式计算系统的迭代冗余设计容错策略
2. Designing Fault Tolerance Strategy by Iterative Redundancy for Component-Based Distributed Computing Systems [J] . HuiWang, YunWang Mathematical Problems in Engineering: Theory, Methods and Applications . 2014,第4期

机译：基于组件的分布式计算系统的迭代冗余设计容错策略
3. Fault Tolerance Properties of Gossip-Based Distributed Orthogonal Iteration Methods [J] . Hana Straková, Gerhard Niederbrucker, Wilfried N. Gansterer Procedia Computer Science . 2013,第1期

机译：基于八卦的分布式正交迭代方法的容错特性
4. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems [C] . Chen Xu, Markus Holzemer, Manohar Kaul, IEEE International Conference on Data Engineering . 2016

机译：分布式数据流系统上迭代图处理的高效容错
5. Exploiting Asynchrony for Performance and Fault Tolerance in Distributed Graph Processing [D] . Vora, Keval Dinesh. 2017

机译：在分布图处理中利用异步实现性能和容错
6. Revisiting the dataflow principle for chemical information processing [O] . Wolf D Ihlenfeldt 2012

机译：重新审视化学信息处理的数据流原理
7. Fault Tolerance Properties of Gossip-Based Distributed Orthogonal Iteration Methods [O] . Straková Hana, Niederbrucker Gerhard, Gansterer Wilfried N. 2013

机译：基于八卦的分布式正交迭代方法的容错特性

On Fault Tolerance for Distributed Iterative Dataflow Processing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅