首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >Understanding the Effects of Communication and Coordination on Checkpointing at Scale
【24h】

Understanding the Effects of Communication and Coordination on Checkpointing at Scale

机译:理解沟通和协调对大规模检查站的影响

获取原文

摘要

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.
机译:容错对未来的大型系统提出了重大挑战。对协调,不协调和混合检查点系统的积极研究已经探索了异步的引入如何解决预期的可伸缩性问题。但是,对于针对大规模应用程序选择和调整这些协议的见解很少。在本文中,我们使用基于仿真的方法来证明弹性机制中的本地检查点活动会显着影响关键工作负载的性能,即使将本地节点计算时间的不到1%分配给弹性机制(非常慷慨)假设)。具体而言,我们表明,尽管有关不协调检查点的大量工作都集中在优化消息日志量上,但是本地检查点活动可能会在很大程度上限制此技术的开销。我们的研究表明,本地检查点会导致流程延迟,该流程延迟可能通过消息传递关系传播到其他流程,从而导致一系列级联的延迟。我们演示了如何调整旨在减少日志量的分层不协调检查点协议,以显着减少规模上的这些同步开销。我们的工作提供了对协调检查点和非协调检查点的批判性分析和比较,并使用户和系统管理员可以根据应用程序和系统特性对检查点方案进行微调。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号