Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

William M. Jones

首页> 外文期刊>Concurrency and Computation >Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

【24h】

Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

机译：具有网络意识的选择性作业检查点和迁移，以增强多集群系统中的协同分配

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Multi-site parallel job schedulers can improve average job turn-around time by making use of fragmented node resources available throughout the grid. By mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local resources can begin execution much earlier; thereby improving system utilization and reducing average queue waiting time. Recent research in this area of scheduling leverages user-provided estimates of job communication characteristics to more effectively partition the job across system resources. In this paper, we address the impact of inaccuracies in these estimates on system performance and show that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are instances where these errors result in poor job scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application performance and turnaround time. Consequently, we explore the use of job checkpointing, termination, migration, and restart (CTMR) to selectively stop offending jobs to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which the process of CTMR improves overall performance. We demonstrate that this technique is beneficial even when the overhead of doing so is costly.

机译：多站点并行作业调度程序可以通过利用整个网格中可用的零散节点资源来缩短平均作业周转时间。通过跨多个潜在集群映射作业，原本会在队列中等待本地资源的作业可以更早地开始执行。从而提高系统利用率并减少平均队列等待时间。调度领域的最新研究利用了用户提供的作业通信特征估计，可以更有效地在系统资源之间分配作业。在本文中，我们解决了这些估计中的不准确性对系统性能的影响，并表明即使存在相当大的不准确性，多站点调度技术也将从这些估计中受益。尽管这些结果令人鼓舞，但在某些情况下这些错误会导致不良的作业调度决策，从而导致网络超额预订。这种情况可能导致应用程序性能和周转时间大大降低。因此，我们探索使用作业检查点，终止，迁移和重新启动（CTMR）来有选择地停止有问题的作业以减轻网络拥塞，并在（或何时）有足够的网络资源可用时重新启动它们。然后，我们描述了CTMR流程改善整体性能的条件和程度。我们证明，即使这样做的开销很大，该技术也是有益的。

著录项

来源
《Concurrency and Computation》 |2009年第13期|1672-1691|共20页
作者
William M. Jones;
展开▼
作者单位

Computer Science Department, Coastal Carolina University, Conway, SC 29526, U.S.A.;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
parallel job scheduling; checkpointing; migration; clusters; grid scheduling;

机译：并行作业调度;检查点移民;集群网格调度;

相似文献

外文文献
中文文献
专利

1. Multiple job co-allocation strategy for heterogeneous multi-cluster systems based on linear programming [J] . Héctor Blanco, Josep Lluís Lérida, Fernando Cores, The Journal of Supercomputing . 2011,第3期

机译：基于线性规划的异构多集群系统多任务协同分配策略
2. Multiple job co-allocation strategy for heterogeneous multi-cluster systems based on linear programming [J] . Hector Blanco, Josep Llufs Leida, Fernando Cores, Journal of supercomputing . 2011,第3期

机译：基于线性规划的异构多集群系统多任务协同分配策略
3. Use of run time predictions for automatic co-allocation of multi-cluster resources for iterative parallel applications [J] . Marco A.S. Netto, Christian Vecchiola, Michael Kirley, Journal of Parallel and Distributed Computing . 2011,第10期

机译：使用运行时预测来为迭代并行应用程序自动分配多集群资源
4. The Impact of Information Availability and Workload Characteristics on the Performance of Job Co-allocation in Multi-clusters [C] . Jones, W.M., Ligon, . 2006

机译：信息可用性和工作量特征对多集群工作协同分配绩效的影响
5. Improving parallel job scheduling performance in multi-clusters through selective job coallocation. [D] . Jones, William M. 2005

机译：通过选择性的作业合并来提高多集群中的并行作业调度性能。
6. Checkpoints to the Brain: Directing Myeloid Cell Migration to the Central Nervous System [O] . Meredith Harrison-Brown, Guo-Jun Liu, Richard Banati 2016

机译：大脑检查站：指导髓样细胞迁移到中枢神经系统。
7. Network-aware Selective Job Checkpoint and Migration to Enhance Co-allocation In Multi-cluster Systems † [O] . William M. Jones 2010

机译：网络感知选择性作业检查点和迁移以增强多集群系统中的共同分配†

Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅