FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Yao Chen; Swathi T. Gurumani; Yun Liang; Guofeng Li; Donghui Guo; Kyle Rupnow; Deming Chen

首页> 外文期刊>IEEE transactions on very large scale integration (VLSI) systems >FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

【24h】

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

机译：FCUDA-NoC：用于CUDA到FPGA流程的可扩展且高效的片上网络实现

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63× and reduce external memory reads by up to 81% compared with a single hardware core implementation.

机译：数据并行输入语言的高级综合（HLS），例如计算机统一设备体系结构（CUDA），可实现对独立计算核心的高效描述和实现。 HLS工具可以有效地将并行描述中存在的许多计算线程转换为独立的优化内核。生成的硬件内核通常会大量共享输入数据并独立产生输出。随着实例化内核数量的增加，片外存储器带宽可能不足以满足需求。因此，可伸缩的系统架构和数据共享机制对于改善系统性能变得必要。事实证明，用于芯片内通信的片上网络（NoC）范例是分层总线或交叉开关互连的有效替代方案，因为它可以减少线路布线拥塞，并具有更高的工作频率和更好的可扩展性以添加新节点。在本文中，我们为现有的CUDA-to-FPGA（FCUDA）流提供了可自定义的NoC架构以及基于目录的数据共享机制，以实现系统的可伸缩性并提高整体系统性能。我们构建了一个全自动的FCUDA-NoC生成器，该生成器将CUDA代码和自定义网络参数作为输入，并为整个NoC系统生成可综合的寄存器传输级别（RTL）代码。我们在VC709 Xilinx评估板上实施NoC系统，并通过一系列基准评估我们的架构。结果表明，与单硬件核心实施相比，我们的FCUDA-NoC设计具有可扩展性和高效性，并且将系统执行时间缩短了63倍，并将外部存储器读取减少了81％。

著录项

来源
《IEEE transactions on very large scale integration (VLSI) systems》 |2016年第6期|2220-2233|共14页
作者
Yao Chen; Swathi T. Gurumani; Yun Liang; Guofeng Li; Donghui Guo; Kyle Rupnow; Deming Chen;
展开▼
作者单位

College of Electronic Information and Optical Engineering, Nankai University, Tianjin, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
CUDA; high-level synthesis (HLS); network-on-chip (NoC); parallel languages; parallel languages.;

机译：CUDA;高级综合（HLS）;片上网络（NoC）;并行语言;并行语言。;

相似文献

外文文献
中文文献
专利

1. An Area-Efficient FPGA Implementation of Network-on-Chip (NoC) Router Architecture for Optimized Multicore-SoC Communication [J] . R. Poovendran, S. Sumathi Sensor Letters: A Journal Dedicated to all Aspects of Sensors in Science, Engineering, and Medicine . 2018,第7期

机译：用于优化多芯SOC通信的芯片上网（NOC）路由器架构的区域有效的FPGA实现
2. Argo: A Real-Time Network-on-Chip Architecture With an Efficient GALS Implementation [J] . Kasapaki Evangelia, Schoeberl Martin, Sorensen Rasmus Bo, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on . 2016,第2期

机译：Argo：具有高效GALS实现的实时片上网络架构
3. Efficient mapping of an image processing application for a network-on-chip based implementation [J] . Marcus Vinicius Carvalho da Silva, Nadia Nedjah, Luiza de Macedo Mourelle International Journal of High Performance Systems Architecture . 2009,第1期

机译：用于基于芯片网络的实现的图像处理应用程序的有效映射
4. Multi-Scale Flow Control for Efficient Mixing: Laboratory Generation of Unsteady Multi-Scale Flows Controlled by Multi-Scale Electromagnetic Forces [C] . S. Ferrari, P. Kewcharoenwong, L. Rossi, IUTAM Symposium on Flow Control and MEMS . 2008

机译：高效混合的多尺度流量控制：由多尺度电磁力控制的非定常数流量的实验室产生
5. Efficient Linear Matrix Solver and Its Hardware Implementations Dedicated to Faster-Than-Real-Time Dynamic Simulation of Large Scale of Power System [D] . Wang, Zhao. 2018

机译：高效的线性矩阵求解器及其硬件实现，专用于大于大规模电力系统的实时动态仿真
6. GenoGAM 2.0: scalable and efficient implementation of genome-wide generalized additive models for gigabase-scale genomes [O] . Georg Stricker, Mathilde Galinier, Julien Gagneur 2018

机译：GenoGAM 2.0：可扩展且高效地实施千兆字节规模基因组的全基因组通用添加剂模型
7. Argo : a real-time network-on-chip architecture with an efficient GALS implementation [O] . Kasapaki E, Schoeberl M, Sørensen RB, 2016

机译：Argo：具有高效GALS实现的实时片上网络架构

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅