首页> 外文会议>2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops >ProSteal: A Proactive Work Stealer for Bulk Synchronous Tasks Distributed on a Cluster of Heterogeneous Machines with Multiple Accelerators
【24h】

ProSteal: A Proactive Work Stealer for Bulk Synchronous Tasks Distributed on a Cluster of Heterogeneous Machines with Multiple Accelerators

机译:ProSteal:针对批量同步任务的主动工作窃取器,分布在具有多个加速器的异构机器集群上

获取原文
获取原文并翻译 | 示例

摘要

Work stealing is an effective load balancing technique in shared memory parallel programming. However, in a distributed setup researchers have pointed out difficulties in termination detection and in sustaining a healthy steal success rate. Keeping unsuccessful steal attempts to a minimum is especially important with many-core accelerators (having specialized engines for data copy-in and copy-out), as this not only ensures that the accelerators (or GPUs) are busy but these copy engines are also working in parallel. A steal attempt by a GPU may dry up one or more stages in this pipeline of copy and execution engines. In a cluster environment, similar problem happens with the pipeline that overlaps remote data transfers with local computations. In this paper, we study the loss in compute-communication overlap as a result of work stealing. We also present a proactive stealing approach that recovers the lost overlap by re-gaining it at the stealer's end. We evaluate our technique over Unicorn, a framework that decomposes bulk synchronous computations over a cluster of nodes equipped with multiple CPUs and GPUs. As compared to conventional random victim selection with half steal strategy, our approach achieves a performance gain of 3.19x while convolving a 4 GB image with a 31*31 filter and 1.34x while multiplying two square matrices of one billion elements each over a 10-node cluster with 120 CPUs and 20 GPUs.
机译:在共享内存并行编程中,工作窃取是一种有效的负载平衡技术。但是,在分布式设置中,研究人员指出了终止检测和维持正常的窃取成功率方面的困难。对于多核加速器(具有专用于数据复制和复制的专用引擎),将不成功的窃取尝试降至最低尤为重要,因为这不仅可以确保加速器(或GPU)繁忙,而且这些复制引擎也可以并行工作。 GPU的窃取尝试可能会使复制和执行引擎这一流水线中的一个或多个阶段枯竭。在集群环境中,类似的问题发生在管道上,该管道将远程数据传输与本地计算重叠。在本文中,我们研究了由于窃取工作而造成的计算通信重叠的损失。我们还提出了一种主动的窃取方法,可以通过在窃取者端重新获得丢失的重叠来恢复丢失的重叠。我们通过Unicorn评估我们的技术,Unicorn是一个框架,该框架分解了配备有多个CPU和GPU的节点集群上的批量同步计算。与采用半窃取策略的常规随机受害者选择相比,我们的方法将31GB的31 * 31滤波器与4 GB的图像卷积在一起时,性能提升为3.19倍;而在10-具有120个CPU和20个GPU的节点集群。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号