首页> 外文OA文献 >Modeling and Optimization of Nonblocking Checkpointing for Optimistic Simulation on Myrinet Clusters
【2h】

Modeling and Optimization of Nonblocking Checkpointing for Optimistic Simulation on Myrinet Clusters

机译:Myrinet集群乐观仿真的无阻塞检查点建模与优化

摘要

Checkpointing-and-Communication Library (CCL) is a recently developed software implementing CPU offloaded checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. Specifically, CCL implements a non-blocking execution mode of memory-tomemory data copy associated with checkpoint operations, based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must sometimes be employed for several reasons, such as maintenance of data consistency, thus adding some overhead to (otherwise CPU cost-free) nonblocking checkpoint operations. In this paper we present a detailed cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization (MC). With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model. We discuss viable techniques to solve the cost model, then we present the implementation ofMC that we have developed within the CCL framework. As we will show, such an implementation is based on proper solutions we introduce to estimate/determine the values of low level system parameters (e.g. the residual completion time for DMA operations). This paper also reports experimental results demonstrating the performance benefits from this optimized re-synchronization semantic, in terms of increase in the execution speed, for a Personal Communication System (PCS) simulation application, selected as a testbed among real world simulation problems.
机译:检查点和通信库(CCL)是最近开发的软件,实现了CPU卸载的检查点功能,以支持对myrinet群集进行乐观并行仿真。具体而言,CCL基于Myrinet网卡板上的可编程DMA引擎提供的数据传输功能,实现了与检查点操作关联的内存内存数据复制的非阻塞执行模式。出于某些原因,有时必须在CPU和DMA活动之间进行重新同步,例如维护数据一致性,从而为(否则CPU节省了成本)无阻塞检查点操作增加了一些开销。在本文中,我们提出了一种用于非阻塞检查点的详细成本模型,并推导了一种性能有效的重新同步语义,我们将其称为最小成本重新同步(MC)。使用这种语义,重新同步的发生要么提交正在进行的基于DMA的检查点操作(导致CPU活动暂停),要么中止该操作(由于减少的提交检查点数量,可能导致预期的回滚成本增加)。通过成本模型评估的最小间接费用期望的基础。我们讨论了解决成本模型的可行技术,然后介绍了在CCL框架内开发的MC的实现。正如我们将要展示的那样,这种实现是基于我们引入的适当解决方案来估算/确定低级系统参数的值(例如DMA操作的剩余完成时间)。本文还报告了实验结果,这些结果证明了针对个人通信系统(PCS)仿真应用(从其作为现实世界仿真问题中的测试平台),从优化的重新同步语义上提高了执行速度,从而带来了性能优势。

著录项

  • 作者

    F. QUAGLIA; SANTORO A.;

  • 作者单位
  • 年度 2005
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号