首页> 外文期刊>Future generation computer systems >A utilization model for optimization of checkpoint intervals in distributed stream processing systems
【24h】

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

机译:分布式流处理系统中的检查点间隔优化的利用模型

获取原文
获取原文并翻译 | 示例

摘要

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization - the fraction of total time available for the system to do useful work - that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay. Our model yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters, interestingly showing it to be dependent only on checkpoint cost and failure rate. We confirm the accuracy and efficacy of the model through simulations and experiments with Apache Flink. Observations of the simulations validate our theoretical model and demonstrate that utilization can be improved using the derived optimal checkpoint interval. Moreover, experimental results with Apache Flink show that we can obtain improvements in system utilization for every case we tested, especially as the system size increases. Our model provides a solid theoretical basis for the analysis and optimization of more elaborate checkpointing approaches.
机译:最先进的分布式流处理系统,例如Apache Flink和Storm,最近包括检查点,为有状态应用提供容错。随着国家大小的增长,这是这些系统前往ExaScale制度的必要事件,并且显然比复制更效率。然而,当前系统使用标称值进行检查点间隔,指示每19天表示大约1次故障,这不考虑检查点过程的突出方面,也不考虑系统规模,这可以容易地导致效率低下的系统操作。为了解决这种缺点,我们提供了一个严格的利用阶段 - 系统为系统提供有用的工作的总时间的分数 - 其中包含检查点间隔,故障率,检查点成本,故障检测和重启成本,系统拓扑的深度消息延迟。我们的模型产生了优雅的利用表达,并提供了鉴于这些参数的最佳检查点间隔,有趣地显示它仅依赖于检查点成本和故障率。我们通过仿真和Apache Flink的实验确认模型的准确性和功效。仿真的观察验证了我们的理论模型,并证明可以使用派生的最佳检查点间隔来改善利用。此外,具有Apache Flink的实验结果表明,我们可以获得我们测试的各种情况的系统利用的改进,特别是随着系统尺寸的增加。我们的模型为更精细的检查点近方法提供了稳定的理论依据,用于分析和优化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号