首页> 外文会议>Reliable Distributed Systems, 2003. Proceedings. 22nd International Symposium on >Raptor: integrating checkpoints and thread migration for cluster management
【24h】

Raptor: integrating checkpoints and thread migration for cluster management

机译:Raptor:集成检查点和线程迁移以进行集群管理

获取原文

摘要

Software distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. However, problems such as cluster component reliability and cluster management, which are not directly related to performance, need to be addressed before SDSM solutions can be widely adopted. This paper presents Raptor, an SDSM cluster management system based on checkpoint/recovery and thread migration. Raptor checkpoints decouple the runtime system and application data from application threads, allowing efficient load balancing, resource allocation, and rollback recovery. There are two important features of the system. First, it reduces checkpoint overhead by only saving application-specific data that cannot be recreated at recovery time. Second, by integrating thread migration capability both at running and recovery, it allows the addition or removal of computing resources from a running application, while adding little or no additional burden on the SDSM application programmer.
机译:软件分布式共享内存(SDSM)提供了在具有成本效益的并行平台(例如工作站集群)上运行共享内存应用程序所必需的抽象。但是,在广泛采用SDSM解决方案之前,需要解决与性能没有直接关系的诸如群集组件可靠性和群集管理之类的问题。本文介绍了Raptor,这是一个基于检查点/恢复和线程迁移的SDSM集群管理系统。猛禽检查点将运行时系统和应用程序数据与应用程序线程解耦,从而实现有效的负载平衡,资源分配和回滚恢复。该系统有两个重要功能。首先,它通过仅保存在恢复时无法重新创建的特定于应用程序的数据来减少检查点开销。其次,通过在运行和恢复时集成线程迁移功能,它允许在运行的应用程序中添加或删除计算资源,而对SDSM应用程序程序员几乎没有或没有额外的负担。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号