首页> 外文期刊>Brazilian Computer Society. Journal >Running resilient MPI applications on a Dynamic Group of Recommended Processes
【24h】

Running resilient MPI applications on a Dynamic Group of Recommended Processes

机译:在建议的动态组上运行弹性MPI应用程序

获取原文
       

摘要

Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended . Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N ? 1 faults (in a system with N processes) while sorting up to 1 billion integers.
机译:摘要高性能计算系统运行的应用程序可能要花几个小时才能执行,并且必须处理可能发生的大量故障。这些系统的大多数现有容错策略都假定容易识别为永久事件的崩溃错误。在几个实际系统中,尤其是在共享集群中,情况并非如此,在这些集群中,即使负载变化也可能导致实际上等同于故障的性能问题。在这项工作中,我们提出了一个新模型来处理该问题,在该模型中,进程之间执行测试以确定是否建议运行它们的处理器(或内核)。被分类为“推荐”的进程形成了运行该应用程序的“推荐进程动态组”(DGRP)。 DGRP仅由未经所有DGRP流程测试为不推荐的流程形成。在DGRP流程未执行的过程中,按照建议进行了连续测试,可以在DGRP流程执行了一轮共识后重新加入DGRP。实验结果是从基于MPI的实现中获得的,在该实现中,HyperQuickSort并行排序算法在运行时对其自身进行了重新配置,以允许最多N? 1个故障(在具有N个进程的系统中),同时对多达10亿个整数进行排序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号