Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers

机译：分布式共享内存多计算机的协调检查点-回滚错误恢复

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determined only by the reliability requirements of the application. Efficient adaptation of this approach to DSM multicomputers is complicated by the absence of explicit messages in DSM systems, the presence of a shared and partially replicated address space, and the presence of a distributed coherency directory. We present solutions to these issues, and propose an error recovery scheme based on coordinated checkpointing and rollback for DSM multicomputers. Our performance evaluation based on trace-driven simulations indicates that this scheme incurs less checkpoint traffic than recovery schemes previously proposed for DSM systems.

机译：已为分布式共享内存（DSM）系统提出的大多数恢复方案都需要不必要的高检查点频率和检查点流量，这对应用程序中进程间通信的频率很敏感。对于消息传递系统，基于协作检查点的低开销错误恢复允许仅根据应用程序的可靠性要求确定检查点的频率。由于DSM系统中没有显式消息，共享和部分复制的地址空间以及分布式一致性目录的存在，使该方法对DSM多计算机的有效适应变得复杂。我们提出了针对这些问题的解决方案，并针对DSM多计算机提出了一种基于协调检查点和回滚的错误恢复方案。我们基于跟踪驱动模拟的性能评估表明，与先前针对DSM系统提出的恢复方案相比，该方案产生的检查点流量更少。

著录项

来源
《Reliable Distributed Systems, 1994. Proceedings., 13th Symposium on》|1994年|P.42-51|共10页
会议地点
作者
Janakiraman; G.; Tamir; Y.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. A New Co-Ordinated Checkpointing and Rollback Recovery Scheme for Distributed Shared Memory Clusters [J] . Minakshi Tripathy, C.R. Tripathy International Journal of Distributed and Parallel Systems . 2011,第1期

机译：分布式共享内存群集的新的协调统一的检查点和回滚恢复方案
2. Lazy garbage collection of recovery state for fault-tolerant distributed shared memory [J] . Sultan F., Nguyen T.D., Iftode L. IEEE Transactions on Parallel and Distributed Systems . 2002,第10期

机译：容错分布式共享内存的恢复状态的惰性垃圾收集
3. Alow Overhead logging Scheme for Fast Recovery in Distributed Shared Memory Systems [J] . Taesoon Park, Heon Y.Yeom Journal of supercomputing . 2000,第3期

机译：一种低开销的日志记录方案，用于分布式共享内存系统中的快速恢复
4. Coordinated checkpointing-rollback error recovery for distributedshared memory multicomputers [C] . Janakiraman G., Tamir Y. Reliable Distributed Systems, 1994. Proceedings., 13th Symposium on . -1

机译：分布式协调检查点-回滚错误恢复共享内存多计算机
5. Application-transparent error recovery techniques for multicomputers [D] . Frazier, Tiffany Michelle 1995

机译：多计算机的应用程序透明错误恢复技术
6. Performance of parallel FDTD method for shared- and distributed-memory architectures: Application tobioelectromagnetics [O] . Miguel Ruiz-Cabello N., Maksims Abaļenkovs, Luis M. Diaz Angulo, 2020

机译：共享和分布式内存架构并行FDTD方法的性能：应用脚踏电磁
7. Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers [O] . 2008

机译：分布式共享内存多计算机的协调检查点 - 回滚错误恢复
8. Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems [R] . Janssens, B. , Fuchs, W. K. 1995

机译：确保分布式共享内存系统中的正确回滚恢复

Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers

摘要

著录项

相似文献

相关主题

期刊订阅