FATM: A failure-aware adaptive fault tolerancemodel for distributed stream processing systems

Akber Syed Muhammad Abrar; Chen Hanhua; Jin Hai

首页> 外文期刊>Concurrency and computation: practice and experience >FATM: A failure-aware adaptive fault tolerancemodel for distributed stream processing systems

【24h】

FATM: A failure-aware adaptive fault tolerancemodel for distributed stream processing systems

机译：农场：分布式流处理系统的失败感知自适应容错模型

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相关主题

摘要

Distributed Stream Processing Systems (DSPS) are very popular to process unbounded data streams in real-time. Low processing latency is a fundamental requirement for DSPS applications to maintain the real-time response. This requirement of low processing latency for DSPS is badly affected due to inevitable failures in computing systems. Generally, DSPS grapple with these inevitable failures by triggering periodic checkpoints. The periodic checkpoints pessimistically persist the application state so that the execution may be resumed after the failure. These periodic checkpoints incur high overheads due to the high frequency of checkpoints triggering, which increases the overall execution time. On the other hand, failure occurrences in real-world systems are not periodic. This sharp contrast between the periodic checkpoints and failure distributions in the real-world systems makes the periodic checkpoints inefficient. We propose a failure-aware adaptive fault tolerance model called FATM which triggers the checkpoints inline with the underlying failure rate. Further, we design a model for utility factor and checkpoint overheads to evaluate the performance of fault tolerance models for DSPS. We implement the FATM atop Apache Flink and perform a series of experiments. To validate the effectiveness of FATM, experiment results are compared with the existing checkpoint-based models of DSPS. The results show that the FATM significantly reduces the checkpoint frequency, increases the utility factor, and reduces the checkpoint overheads by 28%.

机译：分布式流处理系统（DSP）非常流行，可以实时处理无限的数据流。低处理延迟是DSPS应用程序维护实时响应的基本要求。由于计算系统中不可避免的故障，DSP的低处理延迟的这种要求受到严重影响。通常，DSP通过触发定期检查点来掌握这些不可避免的失败。定期检查点令人悲观地持续应用状态，以便在故障之后可以恢复执行。由于检查点触发的高频频率，这些定期检查点引起了高的开销，这增加了整体执行时间。另一方面，现实系统中的失败发生不会是周期性的。定期检查点与实际系统中的故障分布之间的这种鲜明对比使定期检查点效率低下。我们提出了一个名为FATM的失败感知自适应容错模型，它触发了检查点的潜在故障率。此外，我们设计了实用因子和检查点开销的模型，以评估DSP的容错模型的性能。我们在Apache Flink Atop Atop Atop Atop Atop Atop并执行一系列实验。为了验证FATM的有效性，将实验结果与现有的基于检查点的DSP模型进行比较。结果表明，FATM显着降低了检查点频率，提高了实用因子，并将检查点开销减少了28％。

著录项

来源
《Concurrency and computation: practice and experience》 |2021年第10期|e6167.1-e6167.22|共22页
作者
Akber Syed Muhammad Abrar; Chen Hanhua; Jin Hai;
展开▼
作者单位

Huazhong Univ Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Serv Comp Technol & Syst Lab Sch Comp Sci & Technol Cluster & Grid Comp Lab Wuhan 430074 Peoples R China;

Huazhong Univ Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Serv Comp Technol & Syst Lab Sch Comp Sci & Technol Cluster & Grid Comp Lab Wuhan 430074 Peoples R China;

Huazhong Univ Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Serv Comp Technol & Syst Lab Sch Comp Sci & Technol Cluster & Grid Comp Lab Wuhan 430074 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
checkpoints; distributed stream processing; failure prediction; fault tolerance; resilience;

机译：检查点;分布式流处理;故障预测;容错;弹性;

FATM: A failure-aware adaptive fault tolerancemodel for distributed stream processing systems

摘要

著录项

引文网络

相关主题

期刊订阅