首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium Workshops >Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms
【24h】

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

机译:共享高性能计算平台的最佳协作检查点

获取原文

摘要

In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.
机译:在高性能计算环境中,来自各种来源的输入/输出(I / O)通常争夺可用带宽的稀缺。除了应用程序的无故障执行所固有的I / O操作外,检查点/重启(CR)操作的I / O(用于确保出现故障时的进度)还增加了I / O负担争用,导致性能下降。在这项工作中,我们考虑一种协作调度策略,该策略可以优化并发执行基于CR的应用程序的整体性能,这些应用程序共享宝贵的I / O资源。首先,我们提供一个理论模型,然后得出一组必要的约束条件,以最大程度地减少平台上的全球浪费。我们的结果表明,尽管Young / Daly定义了最佳检查点间隔,尽管它为单个应用程序提供了合理的指标,但不足以在平台规模上最佳地解决资源争用问题。因此,我们表明,将最佳检查点时间与I / O调度策略相结合可以显着改善整体应用程序性能,从而最大程度地提高平台吞吐量。总体而言,这些结果为在存在竞争性I / O的同时为大型工作负载提供检查点的同时提供了严格的分析和直接指导,同时最大程度地减少了对应用程序性能的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号