A utilization model for optimization of checkpoint intervals in distributed stream processing systems

Sachini Jayasekara; Aaron Harwood; Shanika Karunasekera

首页> 外文期刊>Future generation computer systems >A utilization model for optimization of checkpoint intervals in distributed stream processing systems

【24h】

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

机译：分布式流处理系统中的检查点间隔优化的利用模型

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization - the fraction of total time available for the system to do useful work - that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay. Our model yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters, interestingly showing it to be dependent only on checkpoint cost and failure rate. We confirm the accuracy and efficacy of the model through simulations and experiments with Apache Flink. Observations of the simulations validate our theoretical model and demonstrate that utilization can be improved using the derived optimal checkpoint interval. Moreover, experimental results with Apache Flink show that we can obtain improvements in system utilization for every case we tested, especially as the system size increases. Our model provides a solid theoretical basis for the analysis and optimization of more elaborate checkpointing approaches.

机译：最先进的分布式流处理系统，例如Apache Flink和Storm，最近包括检查点，为有状态应用提供容错。随着国家大小的增长，这是这些系统前往ExaScale制度的必要事件，并且显然比复制更效率。然而，当前系统使用标称值进行检查点间隔，指示每19天表示大约1次故障，这不考虑检查点过程的突出方面，也不考虑系统规模，这可以容易地导致效率低下的系统操作。为了解决这种缺点，我们提供了一个严格的利用阶段 - 系统为系统提供有用的工作的总时间的分数 - 其中包含检查点间隔，故障率，检查点成本，故障检测和重启成本，系统拓扑的深度消息延迟。我们的模型产生了优雅的利用表达，并提供了鉴于这些参数的最佳检查点间隔，有趣地显示它仅依赖于检查点成本和故障率。我们通过仿真和Apache Flink的实验确认模型的准确性和功效。仿真的观察验证了我们的理论模型，并证明可以使用派生的最佳检查点间隔来改善利用。此外，具有Apache Flink的实验结果表明，我们可以获得我们测试的各种情况的系统利用的改进，特别是随着系统尺寸的增加。我们的模型为更精细的检查点近方法提供了稳定的理论依据，用于分析和优化。

著录项

来源
《Future generation computer systems》 |2020年第9期|68-79|共12页
作者
Sachini Jayasekara; Aaron Harwood; Shanika Karunasekera;
展开▼
作者单位

School of Computing and Information Systems The University of Melbourne Melbourne Australia;

School of Computing and Information Systems The University of Melbourne Melbourne Australia;

School of Computing and Information Systems The University of Melbourne Melbourne Australia;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Fault tolerance; Stream processing; Distributed systems; Checkpoint; Optimization; Modeling;

机译：容错;流处理;分布式系统;检查站;优化;造型;

相似文献

外文文献
中文文献
专利

1. Evaluating the impact of a coordinated checkpointing in distributed data streams processing systems using discrete event simulation [J] . Matheus Bernardelli de Moraes, André Leon Sampaio Gradvohl Revista Brasileira de Computao Aplicada . 2020,第2期

机译：使用离散事件仿真来评估分布式数据流处理系统协调检查点对分布式数据流处理系统的影响
2. Mathematical Models of Optimization of Information Processes in the Automated Control Systems of Radio Engineering Systems with the Information Distributed Processing [J] . Borisov O. N., Esikov O. V., Kislitsyn A. S. Радиотехника . 2003,第10期

机译：具有信息分布式处理的无线电工程系统自动控制系统中信息处理优化的数学模型
3. Mathematical Models of Optimization of Information Processes in the Automated Control Systems of Radio Engineering Systems with the Information Distributed Processing [J] . Borisov O. N., Esikov O. V., Kislitsyn A. S. Радиотехника . 2003,第10期

机译：信息分布式处理的无线电工程系统自动控制系统中信息流程优化的数学模型
4. Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems [C] . Syed Muhammad Abrar Akber, Hanhua Chen, Yonghui Wang, IEEE International Conference on Cloud Networking . 2018

机译：减少分布式流处理系统中检查点的开销
5. Query optimization for distributed stream processing [D] . Liu, Ying 2007

机译：分布式流处理的查询优化
6. Data-Trace Types for Distributed Stream Processing Systems [O] . Konstantinos Mamouras, Caleb Stanford, Rajeev Alur, -1

机译：分布式流处理系统的数据跟踪类型
7. Storage Optimization for Large-Scale Distributed Stream Processing Systems [O] . Kirsten Hildrum, Fred Douglis, Joell. Wolf, 2014

机译：大型分布式流处理系统的存储优化

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅