Scalable Group-based Checkpoint/Restart for Large-Scale Message-passing Systems

机译：基于可扩展的基于组的检查点/重启用于大型消息传递系统

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed.

机译：并行计算机中使用的越来越多的处理器是在大规模并行系统中的容错支持越来越重要。在系统缩放时，我们讨论了用于消息传递应用程序的现有系统级检查点解决方案的不足。我们通过检查点支持，分析两个当前MPI实现的协调成本和阻塞行为。然后提出了一种基于组的解决方案，组合协调检查点和消息记录。实验结果表明其比LAM / MPI和MPICH-VCL更好的性能和可伸缩性。为了协助组形成，提出了一种分析应用程序的通信行为的方法。

著录项

来源
《International Symposium on Parallel Distributed Processing》|2008年||共12页
会议地点
作者
Justin C. Y. Ho; Cho-Li Wang; Francis C. M. Lau;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.138-53;
关键词

相似文献

外文文献
中文文献
专利

1. Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization [J] . Chen Cheng, Du Yunfei, Zuo Ke, Journal of supercomputing . 2019,第8期

机译：通过检查点/重新启动优化实现大规模异构集群上的容错混合编程
2. Stabilizing Large-scale Generalized Systems On Parallel Computers Using Multithreading And Message-passing [J] . Peter Benner, Maribel Castillo, Rafael Mayo, Concurrency and Computation . 2007,第4期

机译：使用多线程和消息传递来稳定并行计算机上的大规模通用系统
3. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems [J] . Yi-Min Wang, Pi-Yu Chung IEEE Transactions on Parallel and Distributed Systems . 1995,第5期

机译：消息传递系统中不协调检查点的检查点空间回收
4. Scalable Group-based Checkpoint/Restart for Large-Scale Message-passing Systems [C] . Justin C. Y. Ho, Cho-Li Wang, Francis C. M. Lau International Symposium on Parallel Distributed Processing . 2008

机译：基于可扩展的基于组的检查点/重启用于大型消息传递系统
5. Extending the Domain of Transparent Checkpoint-Restart for Large-Scale HPC [D] . Garg, Rohan. 2019

机译：扩展大型HPC的透明检查点重启范围
6. Large-Scale Clinical Systems. Impact of Large-Scale Systems on Clinical Practice: The Impact of the HELP Computer System on the LDS Hospital Paper Medical Record [O] . Gilad J. Kuperman, Reed M. Gardner 1990

机译：大型临床系统。大型系统对临床实践的影响：HELP计算机系统对LDS医院纸质病历的影响
7. Scalable group-based checkpoint/restart for large-scale message-passing systems [O] . Ho JCY, Lau FCM, Wang CL 2008

机译：基于组的可扩展检查点/重启，适用于大规模消息传递系统

Scalable Group-based Checkpoint/Restart for Large-Scale Message-passing Systems

摘要

著录项

相似文献

相关主题

期刊订阅