Monitoring strategies for scalable dynamic checkpointing

机译：用于可伸缩动态检查点的监视策略

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes. Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability on production systems, as well as to identifying possible avenues for future improvements. In particular, we evaluate the ability of our system to monitor as many types of events as possible, measure their importance, and forward them to the resilience runtime.

机译：弹性是超大型超级计算机的一项重要挑战。假定当前超级计算机中的故障在时间上是均匀分布的。但是，最近的研究表明，高性能计算系统中的故障在时间上是部分相关的，从而产生较高的故障密度时段。为了使系统适应新的条件，检测这些时段很重要。在本文中，我们提供了一个监视系统，该系统侦听跨计算节点的硬件事件，并将重要事件转发到容错运行时，以便它可以对那些状态更改做出反应。我们的大规模评估显示了此动态检查点方案的多个方面，这对于理解其在生产系统上的适用性以及确定未来改进的可能途径至关重要。特别是，我们评估了系统监视尽可能多类型的事件，衡量其重要性并将其转发到弹性运行时的能力。

著录项

来源
《International Green and Sustainable Computing Conference》|2016年|1-8|共8页
会议地点
作者
Swann Perarnau; Leonardo Bautista-Gomez;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Monitoring; Checkpointing; Runtime; Temperature sensors; Supercomputers; Temperature measurement; Hardware;

机译：监控;检查点;运行时间;温度传感器;超级计算机;温度测量;硬件;

相似文献

外文文献
中文文献
专利

1. Dynamic Distributed Monitoring Strategy for Large-Scale Nonstationary Processes Subject to Frequently Varying Conditions Under Closed-Loop Control [J] . Zhao Chunhui, Sun He IEEE Transactions on Industrial Electronics . 2019,第6期

机译：闭环控制下频繁变化条件下的大型非平稳过程动态分布式监控策略
2. Prospect for immune checkpoint blockade: dynamic and comprehensive monitorings pave the way [J] . Wang Weili, Liu Jie, He Yijing, Pharmacogenomics . 2017,第13期

机译：免疫检查点封锁的前景：动态和综合监控铺平了道路
3. Efficacy and clinical monitoring strategies for immune checkpoint inhibitors and targeted cytokine immunotherapy for locally advanced and metastatic colorectal cancer [J] . Bess Shelby N., Greening Gage J., Muldoon Timothy J. Cytokine & growth factor reviews . 2019,第期

机译：免疫检查点抑制剂和靶向细胞因子免疫治疗局部晚期和转移结直肠癌的疗效和临床监测策略
4. Monitoring strategies for scalable dynamic checkpointing [C] . Swann Perarnau, Leonardo Bautista-Gomez International Green Computing Conference and Sustainable Computing Conference . 2016

机译：监测可扩展动态检查点的策略
5. ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale [D] . Khandelwal, Ankush. 2019

机译：轨道（基于订购的信息转移）：物理引导机器学习框架，以监测全球范围内水体的动态
6. Safety of checkpoint inhibitors for cancer treatment: strategies for patient monitoring and management of immune-mediated adverse events [O] . Marianne Davies, Emily A Duffield 2017

机译：检查点抑制剂在癌症治疗中的安全性：监测和管理免疫介导的不良事件的策略
7. Efficacy and clinical monitoring strategies for immune checkpoint inhibitors and targeted cytokine immunotherapy for locally advanced and metastatic colorectal cancer [O] . Shelby N. Bess, Gage J. Greening, Timothy J. Muldoon 2019

机译：免疫检查点抑制剂和靶向细胞因子免疫治疗局部晚期和转移结直肠癌的疗效和临床监测策略
8. Markovian Models of a Transactional System Supported by Checkpointing and Recovery Strategies. Part 2: Aa Model with a Specified Number of Transactions Between Checkpoints [R] . Nicola, V. F. 1982

机译：检验点和恢复策略支持的交易系统的马尔可夫模型。第2部分：检查点之间具有指定事务数的aa模型

Monitoring strategies for scalable dynamic checkpointing

摘要

著录项

相似文献

相关主题

期刊订阅