首页> 外文期刊>Concurrency and Computation >Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems
【24h】

Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

机译:具有网络意识的选择性作业检查点和迁移,以增强多集群系统中的协同分配

获取原文
获取原文并翻译 | 示例

摘要

Multi-site parallel job schedulers can improve average job turn-around time by making use of fragmented node resources available throughout the grid. By mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local resources can begin execution much earlier; thereby improving system utilization and reducing average queue waiting time. Recent research in this area of scheduling leverages user-provided estimates of job communication characteristics to more effectively partition the job across system resources. In this paper, we address the impact of inaccuracies in these estimates on system performance and show that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are instances where these errors result in poor job scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application performance and turnaround time. Consequently, we explore the use of job checkpointing, termination, migration, and restart (CTMR) to selectively stop offending jobs to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which the process of CTMR improves overall performance. We demonstrate that this technique is beneficial even when the overhead of doing so is costly.
机译:多站点并行作业调度程序可以通过利用整个网格中可用的零散节点资源来缩短平均作业周转时间。通过跨多个潜在集群映射作业,原本会在队列中等待本地资源的作业可以更早地开始执行。从而提高系统利用率并减少平均队列等待时间。调度领域的最新研究利用了用户提供的作业通信特征估计,可以更有效地在系统资源之间分配作业。在本文中,我们解决了这些估计中的不准确性对系统性能的影响,并表明即使存在相当大的不准确性,多站点调度技术也将从这些估计中受益。尽管这些结果令人鼓舞,但在某些情况下这些错误会导致不良的作业调度决策,从而导致网络超额预订。这种情况可能导致应用程序性能和周转时间大大降低。因此,我们探索使用作业检查点,终止,迁移和重新启动(CTMR)来有选择地停止有问题的作业以减轻网络拥塞,并在(或何时)有足够的网络资源可用时重新启动它们。然后,我们描述了CTMR流程改善整体性能的条件和程度。我们证明,即使这样做的开销很大,该技术也是有益的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号