...
首页> 外文期刊>Concurrency, practice and experience >Reducing the number of response time service level objective violations by a cloud-HPC convergence scheduler
【24h】

Reducing the number of response time service level objective violations by a cloud-HPC convergence scheduler

机译:通过Cloud-HPC收敛调度程序减少响应时间服务级别目标违规的次数

获取原文
获取原文并翻译 | 示例
           

摘要

Job scheduling is an old topic in High-Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high.Alarge number of jobs that couldmigrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as themain reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations.We introduce the definition of a cloud-HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using theSimGrid simulation framework,with workload data from productionHPCgrid. The experimental results show that often, there is a large number of empty areas in the scheduling plan ofHPC platforms, whichmakes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallelHPC jobs running on a 2560-processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around267Kcloud jobs in theHPCplatform,with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA-Service Level Agreement). Usually,most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.
机译:作业调度是高性能计算(HPC)中的一个老话题,并且在数据中心中越来越受到研究。大型数据中心通常分为云计算和HPC的单独分区;每个分区通常都有其特定的调度程序。将作业从HPC分区迁移到云的可能性是文献中广泛讨论的一个话题。但是,从云到HPC的作业迁移是一个很少探讨的话题。尽管如此,这种迁移在许多情况下仍可能有用,特别是在HPC平台资源使用率较低且云使用率较高的情况下。在Google中可能会观察到大量可能从云迁移到HPC分区的作业数据中心工作负载。使用超额预订策略的作业调度被认为是云中资源使用率高的主要原因。但是,超额预订会导致很高的重新计划和作业丢弃率,这可能会导致响应时间违规。这项工作表明HPC平台可以托管和执行一些云作业,而这些作业对HPC作业的干扰很小,并且违反响应时间的次数也很少。我们介绍了云-HPC融合区域的定义,并针对此提出了作业调度策略,旨在减少违反云作业的响应时间,而不会影响HPC作业的执行。我们的建议是正式定义的,然后使用SimGrid模拟框架以及来自ProductionHPCgrid的工作量数据在不同的执行方案中进行评估。实验结果表明,HPC平台的调度计划中经常存在大量空白区域,这使得通过回填分配云作业成为可能。这是由于稀疏的HPC作业提交模式和某些HPC平台中的资源使用率较低造成的。一个执行的模拟场景考虑了一组在2560个处理器平台上运行的11K并行HPC作业,其平均资源使用率为38.0%。拟议的收敛调度程序成功地在HPC平台中注入了大约267Kcloud作业,考虑到收敛区域中有80个处理器且对HPC工作负载没有影响,此类作业的响应时间违反率低于0.00094%。该实验根据Google公共云工作负载的工作特征考虑了云工作,其处理时间松弛因子为1.25(在Google云SLA服务水平协议中被视为高优先级)。通常,大多数云作业的松弛系数都高于1.25(大多数云作业的优先级为中或低)。以较高的松弛因子(4)重复进行的相同仿真显示,没有违反响应时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号