...
首页> 外文期刊>Journal of computational science >A fault-tolerant HPC scheduler extension for large and operational ensemble data assimilation: Application to the Red Sea
【24h】

A fault-tolerant HPC scheduler extension for large and operational ensemble data assimilation: Application to the Red Sea

机译:用于大型和整体操作数据同化的容错HPC调度程序扩展:在红海中的应用

获取原文
获取原文并翻译 | 示例

摘要

A fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top ofDecimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. WithinDecimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc.We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.
机译:已经基于MIT通用环流模型(MITgcm)为红海开发了一个完全并行的集合数据同化和预测系统,以模拟红海环流和数据同化研究测试平台(DART)集合同化软件。可操作的集合同化系统的一个重要限制是集合成员崩溃的风险。当过滤器更新步骤对一个或多个与模型物理学不完全一致的预测集合成员进行大的校正时,可能会发生这种情况。希望增加合奏的大小可以改善同化系统的性能,但显然会增加成员崩溃的风险。一些成员遇到的硬件故障或缓慢的数值收敛也应该更频繁地发生。在这种情况下,整个过程的人工操纵似乎是一个真正的挑战,这使得集成同化过程的执行变得不容易且非常耗时。本文介绍了我们为构建高效且容错的MITgcm-DART集成同化系统而付出的努力。能够运营数千名成员。在Decimate的基础上,开发了一个调度程序扩展程序,以简化容错环境中相关作业的提交,监视和动态控制,我们描述了同化系统的实现并详细讨论了其耦合策略。在Decimate中,仅需要使用几行Python来定义灵活的收敛标准并对预测集合成员执行任何必要的操作,例如(i)在工作失败的情况下重新启动有问题的工作,(ii)更改随机种子在收敛性较差或数值不稳定的情况下,(iii)实时调整(减少或增加)并行预测的数量,(iv)实时替换成员以用新成员丰富集合,等等。该系统通过数值实验吸收了红海中真实卫星的海面高度和温度观测结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号