...
首页> 外文期刊>Journal of Parallel and Distributed Computing >Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems
【24h】

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

机译:在大型分层存储系统中优化检查点数据放置并保证突发缓冲区持久性

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Non-volatile devices, such as SSDs, will be an integral part of the deepening storage hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part of a distributed burst buffer service or they can be external. Wherever they are located in the hierarchy, one critical design issue is the SSD endurance under the write-heavy workloads, such as the checkpoint I/O for scientific applications. For these environments, it is widely assumed that checkpoint operations can occur once every 60 min and for each checkpoint step as much as half of the system memory can be written out. Unfortunately, for large-scale HPC applications, the burst buffer SSDs can be worn out much more quickly given the extensive amount of data written at every checkpoint step. One possible solution is to control the amount of data written by reducing the checkpoint frequency. However, a direct effect caused by reduced checkpoint frequency is the increased vulnerability window of system failures and therefore potentially wasted computation time, especially for large-scale compute jobs. In this paper, we propose a new checkpoint placement optimization model which collaboratively utilizes both the burst buffer and the parallel file system to store the checkpoints, with design goals of maximizing computation efficiency while guaranteeing the SSD endurance requirements. Moreover, we present an adaptive algorithm which can dynamically adjust the checkpoint placement based on the system's dynamic runtime characteristics and continuously optimize the burst buffer utilization. The evaluation results show that by using our adaptive checkpoint placement algorithm we can guarantee the burst buffer endurance with at most 5% performance degradation per application and less than 3% for the entire system.
机译:固态硬盘等非易失性设备将成为大规模HPC系统上不断加深的存储层次结构的组成部分。这些设备可以作为分布式突发缓冲区服务的一部分位于计算节点上,也可以位于外部。无论它们位于层次结构中的何处,一个关键的设计问题就是在繁重的写工作负载(例如科学应用程序的检查点I / O)下的SSD耐久性。对于这些环境,广泛认为检查点操作可以每60分钟发生一次,并且对于每个检查点步骤,最多可以写出一半的系统内存。不幸的是,考虑到在每个检查点步骤写入的大量数据,对于大规模HPC应用程序,突发缓冲区SSD可以更快地磨损。一种可能的解决方案是通过降低检查点频率来控制写入的数据量。但是,由降低检查点频率引起的直接影响是系统故障的漏洞窗口增加,因此潜在地浪费了计算时间,尤其是对于大规模计算作业。在本文中,我们提出了一种新的检查点放置优化模型,该模型协同使用突发缓冲区和并行文件系统来存储检查点,其设计目标是在确保SSD耐久性要求的同时最大化计算效率。此外,我们提出了一种自适应算法,该算法可以根据系统的动态运行时特性动态调整检查点的位置,并不断优化突发缓冲区的利用率。评估结果表明,通过使用我们的自适应检查点放置算法,我们可以保证突发缓冲区的耐久性,每个应用程序的性能下降最多5%,而整个系统的性能下降不到3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号