首页> 外文会议>24th ACM international conference on supercomputing 2010 >Optimal Bucket Algorithms for Large MPI Collectives on Torus Interconnects
【24h】

Optimal Bucket Algorithms for Large MPI Collectives on Torus Interconnects

机译:Torus互连上的大型MPI集合的最佳存储桶算法

获取原文
获取原文并翻译 | 示例

摘要

Collectives axe an important and frequently used component of MPI. Bucket algorithms, also known as "large vector" algorithms, were introduced in the early 90's and have since evolved as a well known paradigm for large MPI collectives. Many modern day supercomputers such as the IBM Blue Gene and Cray XT are based on torus interconnects that offer a highly scalable interconnection architecture for distributed memory systems. While near optimal algorithms have been developed for torus interconnects in other paradigms, such as spanning trees, bucket algorithms have not been optimally extended to these networks. In this paper, we study the basic "divide, distribute and gather" MPI collectives for bucket algorithms - Allgather, Reduce-scatter and Allreduce - for large messages on torus interconnects.rnWe present bucket-based algorithms for these collectives on bidirectional links. We show that these algorithms are optimal in terms of bandwidth and computation for symmetric torus networks (i.e. when all the dimensions are equal), matching the theoretical lower bounds For an asymmetric torus, our algorithms are asymptotically optimal and converge to the lower bound for large dimension sizes. We also argue that our bucket algorithms are more scalable on multi-cores in comparison to spanning tree algorithms. Previous studies of bucket algorithms on torus interconnects have focused on unidirectional links and have been unable to obtain tight lower bounds and optimal algorithms. We close this gap by providing stronger lower bounds and showing that our bidirectional algorithms can easily be adapted to the unidirectional case, matching our lower bounds in terms of bandwidth and computational complexity.rnWe implement our algorithms on the IBM Blue Gene/P Supercomputer, which has quad-core nodes connected in a 3-dimensional torus, using the low level communication interface. We demonstrate that our algorithms perform within 7-30% of the lower bounds for different MPI collectives. We demonstrate good scaling using multicores. We also demon-rnstrate a factor of 3 to 17 speedup for various collectives in comparison to the latest optimized MPI implementation.
机译:集合体是MPI的重要且经常使用的组件。存储桶算法(也称为“大向量”算法)在90年代初引入,并已发展成为大型MPI集合的众所周知的范例。许多现代超级计算机(例如IBM Blue Gene和Cray XT)都基于环面互连,这些互连为分布式存储系统提供了高度可扩展的互连体系结构。虽然已经为其他范式(例如,生成树)中的环型互连开发了接近最佳的算法,但存储桶算法尚未最佳地扩展到这些网络。在本文中,我们研究了用于桶形算法的基本“划分,分布和收集” MPI集合-Allgather,Reduce-scatter和Allreduce-用于环形互连上的大型消息。我们证明,这些算法在对称环面网络的带宽和计算方面都是最优的(即,当所有维数相等时),与理论下限匹配。对于非对称环面,我们的算法是渐近最优的,并且收敛到大环的下限。尺寸尺寸。我们还认为,与生成树算法相比,我们的存储桶算法在多核上更具可扩展性。先前关于环型互连的桶算法的研究集中在单向链接上,并且无法获得严格的下界和最佳算法。我们通过提供更强的下界并证明我们的双向算法可以轻松地适应单向情况,在带宽和计算复杂性方面匹配我们的下界,来缩小这一差距。我们在IBM Blue Gene / P超级计算机上实现了我们的算法使用低级通信接口将四核节点连接成3维圆环。我们证明了对于不同的MPI集合,我们的算法在下限的7%到30%之间执行。我们展示了使用多核的良好扩展能力。与最新优化的MPI实施相比,我们还证明了各种集体提速的3到17倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号