首页> 外文会议> >Extending a cluster SSI OS for transparently checkpointing message-passing parallel applications
【24h】

Extending a cluster SSI OS for transparently checkpointing message-passing parallel applications

机译:扩展集群SSI OS,以透明地检查传递消息的并行应用程序的点

获取原文

摘要

Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution times. Since the number of nodes in clusters is growing, faults are more frequent. Thus the application execution time may be greater than the mean time before failure (MTBF) of the cluster. To avoid restarting application from the beginning, it is desirable that cluster systems provide some fault tolerant mechanisms such as checkpoint/restart. An approach to implement efficiently this mechanism is to implement it directly in the application or in the communication library. Another approach is to implement it in an operating system dedicated to clusters. This is more complex but let you checkpoint/restart any message-passing application whatever the communication library. This paper presents basic mechanisms for system initiated checkpoint of message-passing parallel applications running on top of a cluster. Performance results obtained from a prototype implemented in KERRIGHED Single System Image cluster Operating System based on LINUX are presented.
机译:如今,集群被广泛用于执行科学应用。这些应用程序通常是具有较长执行时间的消息传递并行应用程序。由于群集中节点的数量在增长,因此故障更加频繁。因此,应用程序执行时间可能大于群集的平均故障前时间(MTBF)。为了避免从头开始重新启动应用程序,希望集群系统提供一些容错机制,例如检查点/重新启动。有效实现此机制的一种方法是直接在应用程序或通信库中实现它。另一种方法是在专用于群集的操作系统中实现它。这比较复杂,但是无论通信库如何,您都可以检查/重新启动任何传递消息的应用程序。本文介绍了在群集顶部运行的消息传递并行应用程序的系统启动检查点的基本机制。给出了从基于LINUX的KERRIGHED单系统映像集群操作系统中实现的原型获得的性能结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号