Lifeline-based Global Load Balancing

机译：基于生命线的全局负载平衡

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

On shared-memory systems, Cilk-style work-stealing [5] has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS) [24, 28].There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asyn-chronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection.In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully-connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A qui-esced node is reactivated when work arrives from a lifeline, and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10,such passive distributed termination can be detected automatically using the finish construct - no application code is necessary.Our design is implemented in a few hundred lines of X10. On the binomial tree described in [26], the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.9(36 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes.In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on Ar-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.

机译：在共享内存系统上，Cilk风格的工作窃取[5]已被用于有效地并行化基于不规则任务图的应用程序，例如不平衡树搜索（UTS）[24，28]。扩展此方法存在两个主要困难。到分布式内存。在共享内存方法中，窃贼（无工作的节点）不断尝试异步地从随机选择的受害者那里窃取工作，直到找到工作为止。在分布式内存中，小偷不能在不中断执行的情况下自动从受害者那里窃取工作。当工作稀疏时，这会导致性能下降。从本质上讲，传统的工作窃取直接扩展到分布式内存违反了工作窃取的工作优先原则。此外，盗贼会花费无用的CPU周期来攻击没有工作的受害者，从而导致多程序环境中的系统效率低下。其次，检测活动的分布式终止并非易事（检测所有节点上的程序正在寻找工作，因此没有工作）。已对此问题进行了深入研究，并且需要仔细设计才能获得良好的性能。不幸的是，在大多数现有语言/框架中，应用程序开发人员被迫实施自己的分布式终止检测。在本文中，我们提出了一套简单的想法，使工作窃取得以有效地扩展到分布式内存。首先，我们介绍生命线图：低度，低直径，完全连接的有向图。可以从k维超立方体构建此类图。如果节点在w窃取失败之后无法找到工作，则在其生命线图中通知输出边缘后，它将停止运行。静态节点不会打扰其他节点。当工作从生命线到达时，静止节点将被重新激活，并且其自身与其激活的传入生命线共享该工作。当所有节点的计算都已停止时，终止就发生了。在X10之类的语言中，可以使用finish结构自动检测到这种被动的分布式终端-无需应用程序代码。我们的设计在X10的几百行中实现。在[26]中描述的二叉树上，该程序在1024个Power7内核的Infiniband群集上实现了87％的效率，峰值吞吐量为2.37 GNodes / sec。在具有2048个处理器的Blue Gene / P上，它的效率达到87％，峰值吞吐量为0.9（36 GNodes / s。所有数字均与单核顺序性能有关。此实现已重构为可重用的全局负载平衡框架。应用程序可以使用此框架以最少的代码更改获得全局负载平衡。总而言之，我们主张：（a）不涉及应用程序级别全局终止检测的UTS的第一种表述；（b）引入生命线图以减少失败窃取（c）基于Ar-超立方体的简单生命线图的演示，（d）具有比UTS上已发表的结果更高的效率（或相同的效率，但范围更广）的性能，特别是我们的框架可以提供相同或相同的作为不受限制的随机工作窃取实现的更好性能，同时减少了尝试窃取的次数。

著录项

来源
《Proceedings of the 2011 ACM SIGPLAN symposium on principles and practice of parallel programming.》|2011年|p.201-211|共11页
会议地点 San Antonio TX(US);San Antonio TX(US)
作者
Vijay Saraswat; Prabhanjan Kambadur; Sreedhar Kodali; David Grove; Sriram Krishnamoorthy;
展开▼
作者单位

IBM TJ Watson Research Center;

IBM TJ Watson Research Center;

IBM Systems and Technology Group;

IBM TJ Watson Research Center;

Pacific Northwest National Laboratory;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机软件;计算机软件;
关键词
UTS; global load balancing; distributed computing; X10; work-stealing; parallel programming;

机译：悉尼科技大学;全局负载平衡；分布式计算X10;工作窃取；并行编程;

相似文献

外文文献
中文文献
专利

1. Fault Tolerance for Lifeline-Based Global Load Balancing [J] . Claudia Fohry, Marco Bungart, Paul Plock Journal of Software Engineering and Applications . 2017,第13期

机译：基于生命线的全局负载平衡的容错
2. Lifeline-based Global Load Balancing [J] . Vijay Saraswat, Prabhanjan Kambadur, Sreedhar Kodali, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2011,第8期

机译：基于生命线的全局负载平衡
3. A study on a advanced load balancing method - sharing of a load balancing device with multiple customers, and balancing of requests for streaming contents based on user experienced QoS [J] . Jumpei Watase, Takahiro Tamura, Kiyoshi Yanagimoto, 電子情報通信学会技術研究報告. 交換システム . 2001,第715期

机译：先进的负载平衡方法的研究-与多个客户共享负载平衡设备，并根据用户体验的QoS平衡流内容的请求
4. Fault Tolerance for Cooperative Lifeline-Based Global Load Balancing in Java with APGAS and Hazelcast [C] . Jonas Posner, Claudia Fohry IEEE International Parallel and Distributed Processing Symposium Workshops . 2017

机译：带有APGAS和Hazelcast的Java中基于协作生命线的全局负载平衡的容错
5. Development and implementation of the Multi-resolution Assignment and Loading of Transportation Activities (MALTA) simulation based dynamic traffic assignment system, Recursive On-line Load Balance framework (ROLB) [D] . Villalobos Prats, Jorge Alejandro 2011

机译：基于动态交通分配系统，递归在线负载平衡框架（ROLB）的多分辨率运输活动分配和装载（MALTA）模拟的开发和实现
6. The Impact of an External Load of Football Equipment on Dynamic Balance as Assessed by the Modified Star Excursion Balance Test [O] . THOMAS DENEHEY, TIMOTHY MARSHALL, KIM SPACCAROTELLA, -1

机译：修正的星际偏移平衡测试评估足球装备的外部负载对动平衡的影响
7. GLB: Lifeline-based Global Load Balancing library in X10 [O] . Zhang, Wei, Tardieu, Olivier, Grove, David, 2013

机译：GLB：X10中基于生命线的全局负载平衡库

Lifeline-based Global Load Balancing

摘要

著录项

相似文献

相关主题

期刊订阅