Testing SLURM open source batch system for a Tier1/Tier2 HEP computing facility

机译：测试Slurm开源批处理系统，用于Tier1 / Tier2 HEP计算设施

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this work the testing activities that were carried on to verify if the SLURM batch system could be used as the production batch system of a typical Tier1/Tier2 HEP computing center are shown. SLURM (Simple Linux Utility for Resource Management) is an Open Source batch system developed mainly by the Lawrence Livermore National Laboratory, SchedMD, Linux NetworX, Hewlett-Packard, and Groupe Bull. Testing was focused both on verifying the functionalities of the batch system and the performance that SLURM is able to offer. We first describe our initial set of requirements. Functionally, we started configuring SLURM so that it replicates all the scheduling policies already used in production in the computing centers involved in the test, i.e. INFN-Bari and the INFN-Tier1 at CNAF, Bologna. Currently, the INFN-Tier1 is using IBM LSF (Load Sharing Facility), while INFN-Bari, an LHC Tier2 for both CMS and Alice, is using Torque as resource manager and MAUI as scheduler. We show how we configured SLURM in order to enable several scheduling functionalities such as Hierarchical FairShare, Quality of Service, user-based and group-based priority, limits on the number of jobs per user/group/queue, job age scheduling, job size scheduling, and scheduling of consumable resources. We then show how different job typologies, like serial, MPI, multi- thread, whole-node and interactive jobs can be managed. Tests on the use of ACLs on queues or in general other resources are then described. A peculiar SLURM feature we also verified is triggers on event, useful to configure specific actions on each possible event in the batch system. We also tested highly available configurations for the master node. This feature is of paramount importance since a mandatory requirement in our scenarios is to have a working farm cluster even in case of hardware failure of the server(s) hosting the batch system. Among our requirements there is also the possibility to deal with pre-execution and post- execution scripts, and controlled handling of the failure of such scripts. This feature is heavily used, for example, at the INFN-Tier1 in order to check the health status of a worker node before execution of each job. Pre- and post-execution scripts are also important to let WNoDeS, the IaaS Cloud solution developed at INFN, use SLURM as its resource manager. WNoDeS has already been supporting the LSF and Torque batch systems for some time; in this work we show the work done so that WNoDeS supports SLURM as well. Finally, we show several performance tests that we carried on to verify SLURM scalability and reliability, detailing scalability tests both in terms of managed nodes and of queued jobs.

机译：在这项工作中，示出了进行的测试活动，以验证Slurm批处理系统是否可以用作典型的Tier1 / Tier2 Hep计算中心的生产批量系统。 Slurm（资源管理简单的Linux实用程序）是一个开源批处理系统，主要由劳伦斯利弗莫尔国家实验室，Schedmd，Linux Networx，Hewlett-Packard和Groupe Bull。在验证批处理系统的功能和浆料能够提供的性能方面，测试都集中了。我们首先描述了我们的初始要求。在功能上，我们开始配置Slurm，以便它在CNAF中涉及的计算中心中的计算中心复制了已使用的所有计划策略，即CNAF，博洛尼亚的CNAF中的INFN-TIER1。目前，INFN-TIER1使用IBM LSF（负载共享设施），而INFN-BARI是CMS和ALICE的LHC TIER2，则使用扭矩作为资源管理器和毛伊作为调度程序。我们展示了我们如何配置Slurm，以便启用多个调度功能，例如分层展览，服务质量，基于用户和基于组的优先级，限制每个用户/组/队列的作业数量，作业阶段调度，作业大小调度和消耗资源的调度。然后，我们可以展示如何管理不同的作业类型，如串行，MPI，多线程，整个节点和交互式作业。然后描述对队列或一般其他资源在队列中使用ACL的测试。我们还验证了一个特殊的Slurm功能是关于事件的触发器，可用于在批处理系统中配置每个可能事件的特定操作。我们还测试了主节点的高可用配置。由于我们的方案中的强制性要求是在托管批处理系统的服务器的硬件故障的情况下，此功能是重要的，因为我们的方案中的强制性要求是具有工作场群。在我们的要求中，还有可能处理预先执行和执行脚本，并控制此类脚本的失败处理。此功能例如在Infn-Tier1处大量使用，以便在执行每个作业之前检查工作节点的运行状况。在Infn开发的IAAS Cloud解决方案中，执行前和执行后脚本也很重要，请使用Slurm作为其资源管理器。 Wnodes已经支持LSF和扭矩批量系统一段时间;在这项工作中，我们显示完成的工作，以便Wnodes支持Slurm。最后，我们显示了多个性能测试，我们继续进行验证级别的速度和可靠性，详细介绍托管节点和排队作业的可扩展性测试。

著录项

来源
《Conference on Computing in High Energy and Nuclear Physics》|2014年||共6页
会议地点
作者
Giacinto Donvito; Davide Salomoni; Alessandro Italiano;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 O572.2-532;
关键词
Testing; SLURM; Tier1/Tier2;

机译：测试;Slurm;Tier1 / Tier2;

相似文献

外文文献
中文文献
专利

1. FDA Approves Roche Test to Screen Source Plasma Test for HIV, Hep B, Hep C - Clinical Lab Products [J] . Clinical Lab Products . 2009,第2009期

机译：FDA批准罗氏（Roche）测试以筛查HIV，乙肝，丙肝的血浆来源检测-临床实验室产品
2. FDA Approves Roche Test to Screen Source Plasma Test for HIV, Hep B, Hep C - Clinical Lab Products [J] . Clinical Lab Products . 2009,第2009期

机译：FDA批准罗氏（Roche）测试以筛查HIV，乙肝，丙肝的血浆来源检测-临床实验室产品
3. Resource-aware load balancing model for batch of tasks (BoT) with best fit migration policy on heterogeneous distributed computing systems [J] . Mahfooz Alam, Raza Abbas Haidri, Mohammad Shahid International journal of pervasive computing and communications . 2020,第2期

机译：关于异构分布式计算系统最佳拟合迁移策略的批量任务（BOT）的资源感知负载平衡模型
4. Testing SLURM open source batch system for a Tier1/Tier2 HEP computing facility [C] . Giacinto Donvito, Davide Salomoni, Alessandro Italiano Conference on Computing in High Energy and Nuclear Physics . 2014

机译：测试Slurm开源批处理系统，用于Tier1 / Tier2 HEP计算设施
5. Development of a computer-balanced motion table: A ground testing facility for microsatellite attitude control systems. [D] . Hansen, Noah Hans. 2000

机译：开发计算机平衡运动台：用于微卫星姿态控制系统的地面测试设备。
6. Computer Applications in Medical Care. Computer Systems in Hospitals. Systems Which Support Resource Allocation and Utilization: An Automated Medical Resource Allocation and Planning System (MEDRAPS) for U. S. Naval Medical Treatment Facilities [O] . C.W Wrightson, T.L. Kay, J.M. LaRocco, 1983

机译：医疗保健中的计算机应用。医院中的计算机系统。支持资源分配和利用的系统：美国海军医疗设施的自动医疗资源分配和计划系统（MEDRAPS）
7. Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling [O] . Mohak Chadha, Jophin John, Michael Gerndt 2020

机译：扩展用于动态资源感知自适应批处理调度的Slurm

Testing SLURM open source batch system for a Tier1/Tier2 HEP computing facility

摘要

著录项

相似文献

相关主题

期刊订阅