首页> 外文会议>International Conference on High Performance Computing Simulation >A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique
【24h】

A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique

机译:一种使用稀疏网格组合技术的容错旋转等离子体应用

获取原文

摘要

Applications performing ultra-large scale simulations via solving PDEs require very large computational systems for their timely solution. Studies have shown the rate of failure grows with the system size and these trends are likely to worsen in future machines as less reliable components are used to reduce the energy cost. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) is a cost-effective method for solving time-evolving PDEs, especially for higher-dimensional problems. It can also be easily modified to provide algorithm-based fault tolerance for these problems. In this paper, we show how the SGCT can produce a fault-tolerant version of the GENE gyrokinetic plasma application, which evolves a 5D complex density field over time. We use an alternate component grid combination formula to recover data from lost processes. User Level Failure Mitigation (ULFM) MPI is used to recover the processes, and our implementation is robust over multiple failures and recovery for both process and node failures. An acceptable degree of modification of the application is required. Results using the SGCT on two of the fields' dimensions show competitive execution times with acceptable error (within 0.1%), compared to the same simulation with a single full resolution grid. The benefits improve when the SGCT is used over three dimensions. Our experiments show that the GENE application can successfully recover from multiple process failures, and applying the SGCT the corresponding number of times minimizes the error for the lost sub-grids. Application recovery overhead via ULFM MPI increases from ~1.5s at 64 cores to ~5s at 2048 cores for a one-off failure. This compares favourably to using GENE's in-built checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a- single failure, excluding the backtrack overhead. An analysis for a long-running application taking into account checkpoint backtrack times indicates a reduction in overhead of over an order of magnitude.
机译:通过求解PDE执行超大型模拟的应用需要非常大的计算系统,以便及时解决。研究表明,由于系统尺寸,这些趋势可能在未来的机器中恶化,这些趋势可能恶化,因为使用不太可靠的组件来降低能量成本。因此,作为系统,并且在它们上解决的问题,继续增长,生存失败的能力正在成为算法开发的关键方面。稀疏电网组合技术(SGCT)是一种用于解决时间不断发展的PDE的经济有效方法,特别是对于高维问题。它也可以很容易地修改以提供基于算法的容错于这些问题。在本文中,我们展示了SGCT如何产生最容易容忍的基因旋转等离子体应用,这随着时间的推移演变了5D复合密度场。我们使用备用组件网格组合公式来恢复来自丢失进程的数据。用户级别失败缓解(ULFM)MPI用于恢复进程,我们的实现对多个故障以及恢复过程和节点故障的恢复是强大的。需要可接受的应用程度。结果在两个字段中使用SGCT的尺寸显示出具有可接受的误差(0.1%)的竞争执行时间,而与单个全分辨率网格相同。当SGCT使用超过三维时,益处改善了。我们的实验表明,基因应用程序可以从多个过程故障成功恢复,并应用SGCT相应的次数最小化丢失子网格的误差。通过ULFM MPI的应用恢复开销从64个核心的〜1.5s增加到2048个核心的〜5s,以一次性故障。这与使用基因的内置检查点与作业重启结合失败的经典SGCT相结合,这与返回的级别失败有效,从而排除了倒车开销。考虑到检查点回溯时的长期应用程序的分析表明超过一个数量级的开销下降。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号