...
首页> 外文期刊>IEICE Transactions on Information and Systems >Evaluation of Checkpointing Mechanism on SCore Cluster System
【24h】

Evaluation of Checkpointing Mechanism on SCore Cluster System

机译:SCore集群系统中检查点机制的评估

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not boon well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8 ~ 2.8 times and 3.7 ~ 5.0 times faster respectively. This is very helpful for cluster systems to achieve high performance and high availability.
机译:集群系统由于良好的性能/成本比而得到广泛使用。但是,到目前为止,它们的可靠性在实际环境中还没有得到很好的讨论。随着集群系统中商品组件数量的增加,通过系统软件来支持可靠性是必不可少的。 SCore集群系统软件是用于高性能计算(HPC)的并行编程环境。 SCore提供检查点和回滚恢复机制以实现高可用性。在本文中,我们定量分析和评估了SCore的检查点和回滚恢复机制。实验结果表明,检查点所需的时间相对于计算节点的数量非常好。但是,由于有效网络带宽较低,因此所需时间非常长。根据结果​​,我们修改了SCore,并成功使检查点和恢复速度分别提高了1.8〜2.8倍和3.7〜5.0倍。这对于群集系统实现高性能和高可用性非常有帮助。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号