【24h】

Performance Implications of Failures in Large-Scale Cluster Scheduling

机译:大型集群调度中的故障对性能的影响

获取原文
获取原文并翻译 | 示例

摘要

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.
机译:随着我们继续发展为大规模并行系统,其中许多系统使用数百个计算引擎来承担关键任务,因此设计那些能够预见并适应故障发生的系统至关重要。故障已成为此类大型系统的常见特征,因此无法继续将其视为例外。尽管这些系统中的故障目前越来越重要,但我们对这些关键问题对并行计算环境的性能影响的理解仍然非常有限。在本文中,我们基于大规模集群的最新结果开发了一个通用的故障建模框架,然后我们利用该框架对各种调度策略的故障对系统性能的影响进行了详细的性能分析。我们的结果表明,在忽略故障的现有调度策略下,此类故障可能会对平均作业响应时间和平均作业速度产生重大影响。因此,我们研究了不同的调度机制和策略来解决这些性能问题。我们的结果表明,定期检查工作点似乎对缓解此问题无济于事。另一方面,我们证明了有关故障发生的时空相关性的信息在设计调度(作业分配)策略以增强系统性能方面非常有用,而前者提供的好处最大。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号