首页> 外文期刊>IEEE transactions on very large scale integration (VLSI) systems >FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
【24h】

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

机译:FCUDA-NoC:用于CUDA到FPGA流程的可扩展且高效的片上网络实现

获取原文
获取原文并翻译 | 示例

摘要

High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63× and reduce external memory reads by up to 81% compared with a single hardware core implementation.
机译:数据并行输入语言的高级综合(HLS),例如计算机统一设备体系结构(CUDA),可实现对独立计算核心的高效描述和实现。 HLS工具可以有效地将并行描述中存在的许多计算线程转换为独立的优化内核。生成的硬件内核通常会大量共享输入数据并独立产生输出。随着实例化内核数量的增加,片外存储器带宽可能不足以满足需求。因此,可伸缩的系统架构和数据共享机制对于改善系统性能变得必要。事实证明,用于芯片内通信的片上网络(NoC)范例是分层总线或交叉开关互连的有效替代方案,因为它可以减少线路布线拥塞,并具有更高的工作频率和更好的可扩展性以添加新节点。在本文中,我们为现有的CUDA-to-FPGA(FCUDA)流提供了可自定义的NoC架构以及基于目录的数据共享机制,以实现系统的可伸缩性并提高整体系统性能。我们构建了一个全自动的FCUDA-NoC生成器,该生成器将CUDA代码和自定义网络参数作为输入,并为整个NoC系统生成可综合的寄存器传输级别(RTL)代码。我们在VC709 Xilinx评估板上实施NoC系统,并通过一系列基准评估我们的架构。结果表明,与单硬件核心实施相比,我们的FCUDA-NoC设计具有可扩展性和高效性,并且将系统执行时间缩短了63倍,并将外部存储器读取减少了81%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号