首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing
【24h】

NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing

机译:NUMARCK:弹性和检查点的机器学习算法

获取原文

摘要

Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.
机译:数据检查指向是高性能计算(HPC)系统中的一项重要的容错技术。随着HPC系统朝着万亿级级发展,检查点的存储空间和时间成本不仅威胁着仿真,而且也淹没了仿真后的数据分析。解决此问题的一种常见做法是应用压缩算法以减小数据大小。但是,传统的寻找重复模式的无损压缩技术对于使用高精度数据的科学数据无效,因此很难找到常见的模式。本文利用了以下事实:在许多科学应用中,从一个模拟迭代到下一个模拟迭代的数据值的相对变化彼此之间并没有很大的不同。因此,捕获数据中相对变化的分布而不是存储数据本身,使我们能够合并数据的时间维度并了解变化的变化分布。我们表明,在每个数据点保证的用户定义的误差范围内,可以实现一个数量级的数据缩减。我们提出了NUMARCK,这是西北大学用于弹性和校验指向的机器学习算法,该算法利用了连续模拟迭代之间出现的数据变化的新兴分布,并将其编码为可以简明表示的索引空间。我们使用两种生产科学模拟FLASH和CMIP5对NUMARCK进行了评估,并在压缩率和压缩精度方面展示了卓越的性能。更重要的是,我们的算法允许用户以每点为基础指定最大可容忍误差,同时将数据压缩一个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号