Great effort has been devoted to the design of optimized checkpointing strategies for optimistic parallel discrete event simulators. On the other hand there is less work in the direction to improve the execution mode of any single checkpoint operation. Specifically, checkpoint operations are typically charged to the CPU, thus leading to freezing of the simulation application while checkpointing is in progress, i.e. the execution mode of the checkpointing protocol is typically synchronous. In this paper we focus on improvements of the execution mode and present a software architecture, designed for myrinet based Network of Workstations (NOWs), to avoid application freezing during any checkpoint operation, thus moving the execution itself towards an asynchronous mode. This is done by charging checkpoint operations to a hardware component distinct from the CPU, namely a DMA engine. On the other hand, totally asynchronous checkpointing could suffer from data inconsistency whenever the content ofa state buffer is accessed for further modifications while a checkpoint operation involving it is not yet completed. To avoid this, the architecture includes functionalities for resynchronization on demand. We have used these functionalities to implement an execution mode of the checkpointing protocol we refer to as semi-asynchronous. By the results of an experimental study we argue that the semi-asynchronous mode can be an effective solution to almost completely remove the delay associated with any checkpoint operation from the completion time of the simulation.
一直致力于优化乐观并行离散事件模拟器的优化检查点策略的设计。另一方面,在改善任何单个检查点操作的执行模式的方向上,工作量较少。具体而言,检查点操作通常由CPU付费,从而导致在检查点进行过程中冻结模拟应用程序,即检查点协议的执行模式通常是同步的。在本文中,我们着重于执行模式的改进,并提出了一种软件架构,该架构设计用于基于myrinet的工作站网络(NOW),以避免在任何检查点操作期间冻结应用程序,从而将执行本身转移到异步模式。这是通过向与CPU不同的硬件组件(即DMA引擎)收取检查点操作来完成的。另一方面,每当访问状态缓冲区的内容进行进一步修改而涉及它的检查点操作尚未完成时,完全异步的检查点可能会遭受数据不一致的困扰。为了避免这种情况,该体系结构包括用于按需重新同步的功能。我们已经使用这些功能来实现我们称为半异步的检查点协议的执行模式。通过实验研究的结果,我们认为半异步模式可以成为一种有效的解决方案,可以从模拟的完成时间几乎完全消除与任何检查点操作相关的延迟。 I> P>
机译:多程序非阻塞检查点,支持对Myrinet群集进行乐观模拟
机译:用于对Myrinet集群进行乐观仿真的无阻塞检查点的建模和优化
机译:在Myrinet集群上进行最优并行模拟中的事件优先回滚的软件支持
机译:在基于Myrinet的NOW上进行乐观模拟的半异步检查点
机译:具有临时执行功能的在线临时分布式流量模拟
机译:乐观的前景营造出美好的过去:情景模拟对后续记忆的影响
机译:基于Myrinet的NOW的乐观仿真的半异步检查点
机译:并行和分布式系统中错误恢复的乐观执行和检查点比较