首页> 外文OA文献 >A survey of checkpointing algorithms for parallel anddistributed computers
【2h】

A survey of checkpointing algorithms for parallel anddistributed computers

机译:并行和并行检查点算法概述分布式计算机

摘要

Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel=distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery
机译:检查点被定义为程序中的指定位置,在该位置通常中断处理,以保留必要的状态信息以允许以后恢复处理。检查点是保存状态信息的过程。本文调查了文献中报告的检查点并行=分布式系统的算法。已经观察到,为消息传递系统中的检查点发布的大多数算法都是基于Chandy和Lamport的开创性文章。通过放宽本文中的假设并扩展其范围以最大程度地减少协调和上下文保存的开销,该领域已发表了大量文章。共享内存系统的检查点主要扩展了缓存一致性协议,以维护一致的内存。所有这些都假定主存储器可以安全地存储上下文。最近,已经发布了用于分布式共享存储系统的算法,该算法扩展了共享存储系统中使用的缓存一致性协议。但是,它们还包括用于将分布式内存的状态存储在稳定存储中的方法。大多数算法都假定没有关于正在执行的程序的知识。但是,可以感觉到,在并行程序的开发中,用户在分配任务时必须做大量工作,并且该信息可以有效地用于简化检查点和回滚恢复。

著录项

  • 作者

    Kalaiselvi S; Rajaraman V;

  • 作者单位
  • 年度 2000
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号