首页> 外文学位 >Checkpointing a multithreaded distributed shared memory computer system.
【24h】

Checkpointing a multithreaded distributed shared memory computer system.

机译:检查点多线程分布式共享内存计算机系统。

获取原文
获取原文并翻译 | 示例

摘要

Distributing a program over a cluster of commodity processors connected by a commodity network can help speed up a computation for a relatively low cost. Distributed cluster computing is especially useful for long-running scientific applications. As the number of processors and running time of program increase, however, the probability of that one of the system's components will fail before the program ends increases. A program can prepare for failures by periodically saving its state in a checkpoint from which it can be recovered later.; Checkpointing distributed programs requires making sure the checkpoints that individual processes save can be used together to restore a consistent state. Programs using a coordinated checkpointing algorithm communicate to save a consistent state. Programs using a communication-induced checkpointing algorithm build a consistent state without explicit communication. Although communication induced checkpointing algorithms have less communication overhead they do not add significantly less overhead to programs because synchronization overhead is small compared to the amount of time required to save a checkpoint to disk.; A checkpointing system builds consistent global checkpoints from checkpoints of individual processes. Each Unify process has multiple threads, but no checkpointing library existed that could checkpoint multi-threaded programs at the start of this research. This research includes the development of a checkpointing library to checkpoint multithreaded processes on Solaris 2.5 and Linux. The checkpointing library can be used as a standalone checkpointing library for multithreaded processes in addition to being used by Unify.
机译:在由商品网络连接的商品处理器集群上分发程序可以以相对较低的成本帮助加快计算速度。分布式集群计算对于长期运行的科学应用程序特别有用。但是,随着处理器数量的增加和程序运行时间的增加,在程序结束之前系统组件之一发生故障的可能性也随之增加。程序可以通过定期将其状态保存在检查点中来为失败做准备,以便以后可以从中恢复它。对分布式程序进行检查点检查需要确保各个进程保存的检查点可以一起使用以恢复一致状态。使用协调检查点算法的程序进行通信以保存一致的状态。使用通信引发的检查点算法的程序会建立一致的状态,而无需进行显式通信。尽管通信引起的检查点算法具有较少的通信开销,但它们却不会显着减少程序的开销,因为与将检查点保存到磁盘所需的时间相比,同步开销很小。检查点系统从各个流程的检查点构建一致的全局检查点。每个Unify进程都有多个线程,但是在本研究开始时,不存在可以对多线程程序进行检查的检查点库。这项研究包括开发检查点库,以在Solaris 2.5和Linux上检查点多线程进程。除了由Unify使用之外,检查点库还可以用作多线程进程的独立检查点库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号