首页> 外文会议>Conference on Computing in High Energy and Nuclear Physics >Testing SLURM open source batch system for a Tier1/Tier2 HEP computing facility
【24h】

Testing SLURM open source batch system for a Tier1/Tier2 HEP computing facility

机译:测试Slurm开源批处理系统,用于Tier1 / Tier2 HEP计算设施

获取原文

摘要

In this work the testing activities that were carried on to verify if the SLURM batch system could be used as the production batch system of a typical Tier1/Tier2 HEP computing center are shown. SLURM (Simple Linux Utility for Resource Management) is an Open Source batch system developed mainly by the Lawrence Livermore National Laboratory, SchedMD, Linux NetworX, Hewlett-Packard, and Groupe Bull. Testing was focused both on verifying the functionalities of the batch system and the performance that SLURM is able to offer. We first describe our initial set of requirements. Functionally, we started configuring SLURM so that it replicates all the scheduling policies already used in production in the computing centers involved in the test, i.e. INFN-Bari and the INFN-Tier1 at CNAF, Bologna. Currently, the INFN-Tier1 is using IBM LSF (Load Sharing Facility), while INFN-Bari, an LHC Tier2 for both CMS and Alice, is using Torque as resource manager and MAUI as scheduler. We show how we configured SLURM in order to enable several scheduling functionalities such as Hierarchical FairShare, Quality of Service, user-based and group-based priority, limits on the number of jobs per user/group/queue, job age scheduling, job size scheduling, and scheduling of consumable resources. We then show how different job typologies, like serial, MPI, multi- thread, whole-node and interactive jobs can be managed. Tests on the use of ACLs on queues or in general other resources are then described. A peculiar SLURM feature we also verified is triggers on event, useful to configure specific actions on each possible event in the batch system. We also tested highly available configurations for the master node. This feature is of paramount importance since a mandatory requirement in our scenarios is to have a working farm cluster even in case of hardware failure of the server(s) hosting the batch system. Among our requirements there is also the possibility to deal with pre-execution and post- execution scripts, and controlled handling of the failure of such scripts. This feature is heavily used, for example, at the INFN-Tier1 in order to check the health status of a worker node before execution of each job. Pre- and post-execution scripts are also important to let WNoDeS, the IaaS Cloud solution developed at INFN, use SLURM as its resource manager. WNoDeS has already been supporting the LSF and Torque batch systems for some time; in this work we show the work done so that WNoDeS supports SLURM as well. Finally, we show several performance tests that we carried on to verify SLURM scalability and reliability, detailing scalability tests both in terms of managed nodes and of queued jobs.
机译:在这项工作中,示出了进行的测试活动,以验证Slurm批处理系统是否可以用作典型的Tier1 / Tier2 Hep计算中心的生产批量系统。 Slurm(资源管理简单的Linux实用程序)是一个开源批处理系统,主要由劳伦斯利弗莫尔国家实验室,Schedmd,Linux Networx,Hewlett-Packard和Groupe Bull。在验证批处理系统的功能和浆料能够提供的性能方面,测试都集中了。我们首先描述了我们的初始要求。在功能上,我们开始配置Slurm,以便它在CNAF中涉及的计算中心中的计算中心复制了已使用的所有计划策略,即CNAF,博洛尼亚的CNAF中的INFN-TIER1。目前,INFN-TIER1使用IBM LSF(负载共享设施),而INFN-BARI是CMS和ALICE的LHC TIER2,则使用扭矩作为资源管理器和毛伊作为调度程序。我们展示了我们如何配置Slurm,以便启用多个调度功能,例如分层展览,服务质量,基于用户和基于组的优先级,限制每个用户/组/队列的作业数量,作业阶段调度,作业大小调度和消耗资源的调度。然后,我们可以展示如何管理不同的作业类型,如串行,MPI,多线程,整个节点和交​​互式作业。然后描述对队列或一般其他资源在队列中使用ACL的测试。我们还验证了一个特殊的Slurm功能是关于事件的触发器,可用于在批处理系统中配置每个可能事件的特定操作。我们还测试了主节点的高可用配置。由于我们的方案中的强制性要求是在托管批处理系统的服务器的硬件故障的情况下,此功能是重要的,因为我们的方案中的强制性要求是具有工作场群。在我们的要求中,还有可能处理预先执行和执行脚本,并控制此类脚本的失败处理。此功能例如在Infn-Tier1处大量使用,以便在执行每个作业之前检查工作节点的运行状况。在Infn开发的IAAS Cloud解决方案中,执行前和执行后脚本也很重要,请使用Slurm作为其资源管理器。 Wnodes已经支持LSF和扭矩批量系统一段时间;在这项工作中,我们显示完成的工作,以便Wnodes支持Slurm。最后,我们显示了多个性能测试,我们继续进行验证级别的速度和可靠性,详细介绍托管节点和排队作业的可扩展性测试。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号