【24h】

File I/O for MPI Applications in Redundant Execution Scenarios

机译:冗余执行方案中MPI应用程序的文件I / O

获取原文
获取原文并翻译 | 示例

摘要

As multi-petascale and exa-scale high-performance computing (HPC) systems inevitably have to deal with a number of resilience challenges, such as a significant growth in component count and smaller circuit sizes with lower circuit voltages, redundancy may offer an acceptable level of resilience that traditional fault tolerance techniques, such as checkpoint/restart, do not. Although redundancy in HPC is quite controversial due to the associated cost for redundant components, the constantly increasing number of cores-per-processor is tilting this cost calculation toward a system design where computation, such as for redundancy, is much cheaper and communication, needed for checkpoint/restart, is much more expensive. Recent research and development activities in redundancy for Message Passing Interface (MPI) applications focused on availability/reliability models and replication algorithms. This paper takes a first step toward solving an open research problem associated with running a parallel application redundantly, which is file I/O under redundancy. The approach intercepts file I/O calls made by a redundant application to employ coordination protocols that execute file I/O operations in a redundancy-oblivious fashion when accessing a node-local file system, or in a redundancy-aware fashion when accessing a shared networked file system. A proof-of concept prototype is presented and a number of coordination protocols are described and evaluated. The results show the performance impact for redundantly accessing a shared networked file system, but also demonstrate the capability to regain performance by utilizing MPI communication between replicas and parallel file I/O.
机译:由于多PB级和百亿级高性能计算(HPC)系统不可避免地必须应对许多弹性挑战,例如组件数量的显着增长和电路电压较低的较小电路尺寸,因此冗余可以提供可接受的水平传统的容错技术(例如检查点/重新启动)所没有的弹性。尽管由于冗余组件的相关成本,HPC中的冗余备受争议,但每个处理器核心数的不断增加使此成本计算趋向于一种系统设计,在该系统设计中,诸如冗余的计算要便宜得多并且需要通信用于检查点/重启,则要贵得多。消息传递接口(MPI)应用程序在冗余方面的最新研究和开发活动集中在可用性/可靠性模型和复制算法上。本文迈出了解决与冗余运行并行应用程序(即冗余下的文件I / O)相关的开放研究问题的第一步。该方法拦截由冗余应用程序进行的文件I / O调用,以采用协调协议,该协议在访问节点本地文件系统时以冗余的方式而不是冗余的方式执行文件I / O操作,而在访问共享的共享系统时以冗余的方式执行文件I / O操作。网络文件系统。提出了概念证明原型,并描述和评估了许多协调协议。结果显示了对冗余访问共享的网络文件系统的性能的影响,同时还展示了通过利用副本与并行文件I / O之间的MPI通信来恢复性能的能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号