NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing

机译：NUMARCK：弹性和检查点的机器学习算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.

机译：数据检查指向是高性能计算（HPC）系统中的一项重要的容错技术。随着HPC系统朝着万亿级级发展，检查点的存储空间和时间成本不仅威胁着仿真，而且也淹没了仿真后的数据分析。解决此问题的一种常见做法是应用压缩算法以减小数据大小。但是，传统的寻找重复模式的无损压缩技术对于使用高精度数据的科学数据无效，因此很难找到常见的模式。本文利用了以下事实：在许多科学应用中，从一个模拟迭代到下一个模拟迭代的数据值的相对变化彼此之间并没有很大的不同。因此，捕获数据中相对变化的分布而不是存储数据本身，使我们能够合并数据的时间维度并了解变化的变化分布。我们表明，在每个数据点保证的用户定义的误差范围内，可以实现一个数量级的数据缩减。我们提出了NUMARCK，这是西北大学用于弹性和校验指向的机器学习算法，该算法利用了连续模拟迭代之间出现的数据变化的新兴分布，并将其编码为可以简明表示的索引空间。我们使用两种生产科学模拟FLASH和CMIP5对NUMARCK进行了评估，并在压缩率和压缩精度方面展示了卓越的性能。更重要的是，我们的算法允许用户以每点为基础指定最大可容忍误差，同时将数据压缩一个数量级。

著录项

来源
《International Conference for High Performance Computing, Networking, Storage and Analysis》|2014年|733-744|共12页
会议地点
作者
Zhengzhang Chen; Seung Woo Son; Hendrix William; Agrawal Ankit; Wei-Keng Liao; Choudhary Alok;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
checkpointing; data analysis; iterative methods; learning (artificial intelligence); parallel processing; software fault tolerance; HPC system; NUMARCK; Northwestern University machine learning algorithm for resiliency and check pointing; data analysis; fault tolerance technique; high performance computing; simulation iteration; Approximation algorithms; Approximation methods; Checkpointing; Computational modeling; Data models; Error analysis; Machine learning algorithms;

机译：检查点;数据分析;迭代方法;学习（人工智能）;并行处理;软件容错性; HPC系统; NUMARCK;西北大学用于弹性和检查点的机器学习算法;数据分析;容错技术;高性能计算;仿真迭代近似算法近似方法检查点计算模型数据模型误差分析机器学习算法;

相似文献

外文文献
中文文献
专利

1. Use of Machine Learning Algorithms to Predict Subgrade Resilient Modulus [J] . Pahno Steve, Yang Jidong J., Kim S. Sonny Infrastructures . 2021,第6期

机译：使用机器学习算法预测路基弹性模量
2. Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS [J] . Olowononi Felix O., Rawat Danda B., Liu Chunmei Communications Surveys & Tutorials, IEEE . 2021,第1期

机译：网络网络物理系统的弹性机器学习：机器学习安全对CPS机器学习的机器学习安全调查
3. Implementation of Machine Learning Classification Regarding Hemiplegic Gait Using an Assortment of Machine Learning Algorithms with Quantification from Conformal Wearable and Wireless Inertial Sensor System [J] . Robert LeMoyne, Timothy Mastroianni 生物医学工程（英文） . 2021,第012期

机译：Implementation of Machine Learning Classification Regarding Hemiplegic Gait Using an Assortment of Machine Learning Algorithms with Quantification from Conformal Wearable and Wireless Inertial Sensor System
4. Resilient Machine Learning (rML) Ensemble Against Adversarial Machine Learning Attacks [C] . Likai Yao, Cihan Tunc, Pratik Satam, International Conference on Dynamic Data Driven Applications Systems . 2020

机译：弹性机器学习（RML）集合对抗对抗机器学习攻击
5. Machine Learning and Algorithmic Bias: a Basic Qualitative Exploration of AI, Machine Learning, Bias and Regulation [D] . Meshcheryakov, Nell. 2021

机译：机器学习和算法偏见：AI，机器学习，偏见和调节的基本定性探索
6. Using Machine Learning to Predict Early Preparation of Pharmacy Prescriptions at PSMMC - a Comparison of Four Machine Learning Algorithms [O] . Nora Alhorishi, Mohammed Almeziny, Riyad Alshammari 2021

机译：采用机器学习预测PSMMC的药房处方的早期准备 - 四种机器学习算法的比较
7. Use of Machine Learning Algorithms to Predict Subgrade Resilient Modulus [O] . Steve Pahno, Jidong J. Yang, S. Sonny Kim 2021

机译：使用机器学习算法预测路基弹性模量

NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing

摘要

著录项

相似文献

相关主题

期刊订阅