【24h】

Lifeline-based Global Load Balancing

机译:基于生命线的全局负载平衡

获取原文
获取原文并翻译 | 示例

摘要

On shared-memory systems, Cilk-style work-stealing [5] has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS) [24, 28].There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asyn-chronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection.In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully-connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A qui-esced node is reactivated when work arrives from a lifeline, and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10,such passive distributed termination can be detected automatically using the finish construct - no application code is necessary.Our design is implemented in a few hundred lines of X10. On the binomial tree described in [26], the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.9(36 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes.In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on Ar-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.
机译:在共享内存系统上,Cilk风格的工作窃取[5]已被用于有效地并行化基于不规则任务图的应用程序,例如不平衡树搜索(UTS)[24,28]。扩展此方法存在两个主要困难。到分布式内存。在共享内存方法中,窃贼(无工作的节点)不断尝试异步地从随机选择的受害者那里窃取工作,直到找到工作为止。在分布式内存中,小偷不能在不中断执行的情况下自动从受害者那里窃取工作。当工作稀疏时,这会导致性能下降。从本质上讲,传统的工作窃取直接扩展到分布式内存违反了工作窃取的工作优先原则。此外,盗贼会花费无用的CPU周期来攻击没有工作的受害者,从而导致多程序环境中的系统效率低下。其次,检测活动的分布式终止并非易事(检测所有节点上的程序正在寻找工作,因此没有工作)。已对此问题进行了深入研究,并且需要仔细设计才能获得良好的性能。不幸的是,在大多数现有语言/框架中,应用程序开发人员被迫实施自己的分布式终止检测。在本文中,我们提出了一套简单的想法,使工作窃取得以有效地扩展到分布式内存。首先,我们介绍生命线图:低度,低直径,完全连接的有向图。可以从k维超立方体构建此类图。如果节点在w窃取失败之后无法找到工作,则在其生命线图中通知输出边缘后,它将停止运行。静态节点不会打扰其他节点。当工作从生命线到达时,静止节点将被重新激活,并且其自身与其激活的传入生命线共享该工作。当所有节点的计算都已停止时,终止就发生了。在X10之类的语言中,可以使用finish结构自动检测到这种被动的分布式终端-无需应用程序代码。我们的设计在X10的几百行中实现。在[26]中描述的二叉树上,该程序在1024个Power7内核的Infiniband群集上实现了87%的效率,峰值吞吐量为2.37 GNodes / sec。在具有2048个处理器的Blue Gene / P上,它的效率达到87%,峰值吞吐量为0.9(36 GNodes / s。所有数字均与单核顺序性能有关。此实现已重构为可重用的全局负载平衡框架。应用程序可以使用此框架以最少的代码更改获得全局负载平衡。总而言之,我们主张:(a)不涉及应用程序级别全局终止检测的UTS的第一种表述;(b)引入生命线图以减少失败窃取(c)基于Ar-超立方体的简单生命线图的演示,(d)具有比UTS上已发表的结果更高的效率(或相同的效率,但范围更广)的性能,特别是我们的框架可以提供相同或相同的作为不受限制的随机工作窃取实现的更好性能,同时减少了尝试窃取的次数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号