首页> 外文会议>Latin-American Symposium on Dependable Computing >Running Resilient MPI Applications on a Dynamic Group of Recommended Processes
【24h】

Running Resilient MPI Applications on a Dynamic Group of Recommended Processes

机译:在建议的动态过程组上运行弹性MPI应用程序

获取原文

摘要

HPC systems run applications that can take several hours to executeand have to deal with the occurrence of a potentially large numberof faults. Most of the existing fault-tolerance strategies for thesesystems assume crash faults that are permanent events easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work we present a new model to deal with this problem in which processes execute tests among themselves in orderto determine whether the processors (or cores) on which they are runningare recommended or non-recommended. Processes classified asrecommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that havenot been tested as non-recommended by all DGRP processes. A processnot in the DGRP that is continuously tested as recommended can rejoin theDGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at run time to tolerate up to N -1 faults (in a system with N processes) while sorting up to 1 billion integers.
机译:HPC系统运行的应用程序可能需要花费几个小时才能执行,并且必须处理可能发生的大量故障。这些系统的大多数现有容错策略都假设崩溃故障是易于检测的永久事件。在几个实际系统中,尤其是在共享集群中,情况并非如此,在这些集群中,即使负载变化也可能导致实际上等同于故障的性能问题。在这项工作中,我们提出了一个新模型来处理此问题,在该模型中,进程之间执行测试以确定是否建议运行它们的处理器(或核心)。归类为“推荐”的进程形成了运行该应用程序的“推荐进程动态组”(DGRP)。 DGRP仅由未经所有DGRP流程测试为不推荐的流程形成。 DGRP过程中未达成共识的过程(按建议进行了连续测试)可以在DGRP过程执行了一系列共识之后重新加入DGRP。实验结果是从基于MPI的实现中获得的,在该实现中,HyperQuickSort并行排序算法在运行时重新配置自身,以允许最多N -1个错误(在具有N个进程的系统中)同时最多排序10亿个整数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号