首页> 外文期刊>Concurrency and computation: practice and experience >FATM: A failure-aware adaptive fault tolerancemodel for distributed stream processing systems
【24h】

FATM: A failure-aware adaptive fault tolerancemodel for distributed stream processing systems

机译:农场:分布式流处理系统的失败感知自适应容错模型

获取原文
获取原文并翻译 | 示例

摘要

Distributed Stream Processing Systems (DSPS) are very popular to process unbounded data streams in real-time. Low processing latency is a fundamental requirement for DSPS applications to maintain the real-time response. This requirement of low processing latency for DSPS is badly affected due to inevitable failures in computing systems. Generally, DSPS grapple with these inevitable failures by triggering periodic checkpoints. The periodic checkpoints pessimistically persist the application state so that the execution may be resumed after the failure. These periodic checkpoints incur high overheads due to the high frequency of checkpoints triggering, which increases the overall execution time. On the other hand, failure occurrences in real-world systems are not periodic. This sharp contrast between the periodic checkpoints and failure distributions in the real-world systems makes the periodic checkpoints inefficient. We propose a failure-aware adaptive fault tolerance model called FATM which triggers the checkpoints inline with the underlying failure rate. Further, we design a model for utility factor and checkpoint overheads to evaluate the performance of fault tolerance models for DSPS. We implement the FATM atop Apache Flink and perform a series of experiments. To validate the effectiveness of FATM, experiment results are compared with the existing checkpoint-based models of DSPS. The results show that the FATM significantly reduces the checkpoint frequency, increases the utility factor, and reduces the checkpoint overheads by 28%.
机译:分布式流处理系统(DSP)非常流行,可以实时处理无限的数据流。低处理延迟是DSPS应用程序维护实时响应的基本要求。由于计算系统中不可避免的故障,DSP的低处理延迟的这种要求受到严重影响。通常,DSP通过触发定期检查点来掌握这些不可避免的失败。定期检查点令人悲观地持续应用状态,以便在故障之后可以恢复执行。由于检查点触发的高频频率,这些定期检查点引起了高的开销,这增加了整体执行时间。另一方面,现实系统中的失败发生不会是周期性的。定期检查点与实际系统中的故障分布之间的这种鲜明对比使定期检查点效率低下。我们提出了一个名为FATM的失败感知自适应容错模型,它触发了检查点的潜在故障率。此外,我们设计了实用因子和检查点开销的模型,以评估DSP的容错模型的性能。我们在Apache Flink Atop Atop Atop Atop Atop Atop并执行一系列实验。为了验证FATM的有效性,将实验结果与现有的基于检查点的DSP模型进行比较。结果表明,FATM显着降低了检查点频率,提高了实用因子,并将检查点开销减少了28%。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号