首页> 外文会议>International Conference on Field-Programmable Technology >A Reconfigurable Compute-in-the-Network FPGA Assistant for High-Level Collective Support with Distributed Matrix Multiply Case Study
【24h】

A Reconfigurable Compute-in-the-Network FPGA Assistant for High-Level Collective Support with Distributed Matrix Multiply Case Study

机译:具有分布式矩阵乘法案例研究的高级集体支持的可重新配置的网络内FPGA助理

获取原文

摘要

Collectives are a fundamental part of HPC applications and their optimization has undergone decades of study. In recent years collectives have been accelerated with in-network hardware support, initially in the NIC, but recently also in the switch. This support is limited, however, to a very small set of scalar operations. In this work, we first propose that these collectives be extended to operations on composite data types such as matrices. We then demonstrate how these high-level collectives can be supported in an FPGA-based switch. In this paper, we propose a reconfigurable compute-in-the-network FPGA assistant, FPin, to implement high-level collectives in MPI. To maintain streaming packet processing while retaining reuse-based compute-intensive processing we propose a bulk-streaming message passing interface along with a methodology to tune communication-computation overlap. As a proof of concept, we evaluate the efficiency of the FPGA assistant with the ubiquitous distributed matrix multiply kernel, PGEMM. Experimental results show that PGEMM accelerated with high-level collective support can achieve, on average, 2.4× and 1.8× speedups on an FPGA cluster compared to the state-of-the-art COSMA algorithm on Stampede2 Skylake for float and complex float data types, respectively.
机译:集体是HPC申请的基本部分,其优化经历了几十年的研究。近年来,集体已经加速了网络内硬件支持,最初在NIC中,但最近也在开关中。然而,这种支持是有限的,这是一组非常小的标量操作。在这项工作中,我们首先建议这些集体扩展到矩阵中的复合数据类型的操作。然后,我们展示了基于FPGA的开关中可以支持这些高级集体。在本文中,我们提出了一种可重新配置的网络内FPGA助手FPIN,实现MPI中的高级集体。为了维护流媒体分组处理,同时保留基于重用的计算密集型处理,我们提出了批量流传递界面以及曲调通信计算重叠的方法。作为概念证明,我们评估了FPGA助手的效率与普遍存在的分布式矩阵乘法核,PGEMM。实验结果表明,与高级别集体支持的PGEMM加速,平均而言,与FPGA集群相比,在FPGA集群上的加速度,与浮动浮动和复杂的浮动数据类型的Sckede2 Skylake上的最先进的Cosma算法相比,FPGA集群上的加速, 分别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号