Analysis of Blocking and Scheduling for FPGA-Based Floating-Point Matrix Multiplication

Ahmad Khayyat; Naraig Manjikian

首页> 外文期刊>Canatian electrical engineering journal >Analysis of Blocking and Scheduling for FPGA-Based Floating-Point Matrix Multiplication

【24h】

Analysis of Blocking and Scheduling for FPGA-Based Floating-Point Matrix Multiplication

机译：基于FPGA的浮点矩阵乘法的分组与调度分析。

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Cet article traite le blocage et la planification pour la conception et l'implémentation d'une multiplication matricielle parallèle à virgule flottante en utilisant un circuit logique programmable (FPGA) avec une hiérarchie de mémoire. Pour atteindre des hautes performances, une mémoire sur puce contient des données qui sont réutilisées lorsque le calcul est divisé en sous blocs, et plusieurs unités arithmétiques réalisent des opérations indépendantes dans chaque bloc en parallèle. La première contribution de cet article est une analyse détaillée de l'espace de conception pour caractériser la performance basée sur la quantité de mémoire sur puce utilisée et les approches envisagées pour le blocage et l'ordonnancement du calcul. Une comparaison normalisée est également présentée par rapport à des travaux antérieurs. La seconde contribution est une implémentation haute performance et flexible sur un FPGA Altera Stratix Ⅳ EP4SGX530C2 avec une interface pour une mémoire externe synchrone et dynamique à double débit de données (DDR2 SDRAM). Plusieurs options de configuration pour optimiser différents objectifs, et des configurations subséquentes ont été vérifiées par des simulations ainsi que par implémentation. Pour doubler la précision en virgule flottante, une performance de 16 opérations de giga-virgule flottante par seconde (GFLOPS) est réalisée avec 64 unités arithmétiaues à 160 MHz.%This paper considers blocking and scheduling for the design and implementation of field-programmable gate array (FPGA)-based floating-point parallel matrix multiplication in the presence of a memory hierarchy. For high performance, on-chip memory holds data that are reused when the computation is divided into blocks, and multiple arithmetic units perform independent operations within each block in parallel. The first contribution of this paper is a detailed analysis of the design space to characterize performance based on the amount of on-chip memory used and the approaches considered for blocking and scheduling of the computation. A comparison is also made to prior work with a unified view. The second contribution is a flexible high-performance implementation for the Altera Stratix Ⅳ EP4SGX530C2 FPGA with an interface to external double-data-rate synchronous dynamic RAM (DDR2 SDRAM) memory. Various configuration options support optimization of different objectives, and the resulting configurations have been verified in simulation and in hardware. For double-precision floating-point, a performance of 16 giga-floating-point operations per second (GFLOPS) is achievable with 64 arithmetic units at 160 MHz.

机译：本文讨论了使用具有存储器层次结构的可编程逻辑电路（FPGA）设计和实现并行浮点矩阵乘法的模块和计划。为了实现高性能，芯片上的存储器包含当计算分为子块时可以重用的数据，并且多个算术单元在每个块中并行执行独立的运算。本文的第一个贡献是对设计空间的详细分析，以基于所使用的芯片上的内存量以及为阻止和调度计算所设想的方法来表征性能。与以前的工作相比，还提供了标准化比较。第二个贡献是在Altera StratixⅣEP4SGX530C2 FPGA上的高性能和灵活实现，该FPGA具有用于双倍数据速率（DDR2 SDRAM）的同步和动态外部存储器的接口。用于优化不同目标的几种配置选项以及后续配置已通过仿真和实施进行了验证。为了使浮点精度提高一倍，在160 MHz下使用64个算术单元执行了每秒16个浮点十进制十进制运算（GFLOPS）的性能。％本文考虑了用于现场可编程门设计和实现的阻塞和调度存在内存层次结构时，基于数组（FPGA）的浮点并行矩阵乘法。为了获得高性能，片上存储器保存在将计算分为多个块时可以重复使用的数据，并且多个算术单元在每个块内并行执行独立的运算。本文的第一个贡献是对设计空间的详细分析，以基于所使用的片上存储器数量以及考虑的用于阻塞和调度计算的方法来表征性能。还使用统一视图对先前的工作进行了比较。第二个贡献是为Altera StratixⅣEP4SGX530C2 FPGA提供了灵活的高性能实现，并具有与外部双数据速率同步动态RAM（DDR2 SDRAM）存储器的接口。各种配置选项支持对不同目标的优化，并且最终的配置已在仿真和硬件中得到验证。对于双精度浮点，在160 MHz下使用64个算术单元可以实现每秒16千兆浮点运算（GFLOPS）的性能。

著录项

来源
《Canatian electrical engineering journal》 |2014年第2期|65-75|共11页
作者
Ahmad Khayyat; Naraig Manjikian;
展开▼
作者单位

Department of Computer Engineering, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia;

Department of Electrical and Computer Engineering, Queen's University, Kingston, ON K7L 3N6, Canada;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Accelerator architectures; floating-point arithmetic; matrices; parallel architectures; reconfigurable logic;

机译：加速器架构;浮点运算;矩阵并行架构;可重构逻辑;
入库时间 2022-08-18 00:53:22

相似文献

外文文献
中文文献
专利

1. Analysis of Blocking and Scheduling for FPGA-Based Floating-Point Matrix Multiplication Analyse du blocage et de l’ordonnancement d’une multiplication matricielle à virgule flottante sur un FPGA [J] . Khayyat A., Manjikian N. Electrical and Computer Engineering, Canadian Journal of . 2014,第2期

机译：基于FPGA的浮点矩阵乘法的调度与调度分析。
2. High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication [J] . Erik H. DHollander Computer architecture news . 2016,第4期

机译：块浮点矩阵乘法的高级综合优化
3. FPGA-Based Scalable and Power-Efficient Fluid Simulation using Floating-Point DSP Blocks [J] . Kentaro Sano, Satoru Yamamoto IEEE Transactions on Parallel and Distributed Systems . 2017,第10期

机译：使用浮点DSP模块的基于FPGA的可扩展且高效节能的流体仿真
4. LDPC Decoder with a Limited-Precision FPGA-based Floating-Point Multiplication Coprocessor [C] . Raymond Moberly, Michael OSullivan, Khurram Waheed Conference on Advanced Signal Processing Algorithms, Architectures, and Implementations . 2007

机译：LDPC解码器具有有限精密的FPGA浮点乘法协处理器
5. Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable Logic. [D] . Khayyat, Ahmad. 2013

机译：分析驱动设计的可重配置逻辑中的并行浮点矩阵乘法。
6. Identification of an Erythrocyte Binding Peptide from the Erythrocyte Binding Antigen EBA-175 Which Blocks Parasite Multiplication and Induces Peptide-Blocking Antibodies [O] . P. H. Jakobsen, P. M. H. Heegaard, C. Koch, 1998

机译：从红细胞结合抗原EBA-175的红细胞结合肽的鉴定EBA-175会阻止寄生虫繁殖并诱导肽封闭抗体。
7. High-level synthesis optimization for blocked floating-point matrix multiplication [O] . DHollander Erik 2017

机译：阻塞浮点矩阵乘法的高级综合优化

Analysis of Blocking and Scheduling for FPGA-Based Floating-Point Matrix Multiplication

摘要

著录项

相似文献

相关主题

期刊订阅