首页> 外文会议>Exascale MPI Workshop;International Conference for High Performance Computing, Networking, Storage and Analysis >Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System
【24h】

Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

机译:将节点间通信与弹性异步多任务运行系统系统集成

获取原文

摘要

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.
机译:实现容错是由于柔软/瞬态故障的投影增加,ExaScale计算的重大挑战之一。虽然过去的工作基于软件的恢复技术通常专注于传统的批量同步并行编程模型,但我们认为异步许多任务(AMT)编程模型更适合启用弹性,因为它们提供了有助于的数据和任务的显式抽象增加了异步和潜伏宽容。在本文中,我们通过集成执行异步MPI通信的能力,扩展了我们过去的工作能够在单节点AMT程序中实现应用程序级恢复力,从而在多个节点上启用弹性。我们还实现抵御失败错误的恢复力,我们运行时将管理所有重新执行任务和通信而无需用户干预。我们的结果表明,我们能够通过将通信卸载到专用通信工人的通信,并透明地从故障停止错误恢复,从而为具有低开销的营销程序为弹性程序添加通信操作,从而提高生产力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号