Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

Tili Ilian; Ovtcharov Kalin; Steffan J. Gregory

首页> 外文期刊>ACM transactions on reconfigurable technology and systems >Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

【24h】

Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

机译：使用TILT缩小软标量CPU与定制硬件之间的性能差距

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

By using resource sharing field-programmable gate array (FPGA) compute engines, we can reduce the performance gap between soft scalar CPUs and resource-intensive custom datapath designs. This article demonstrates that Thread-and Instruction-Level parallel Template architecture (TILT), a programmable FPGA-based horizontally microcoded compute engine designed to highly utilize floating point (FP) functional units (FUs), can improve significantly the average throughput of eight FP-intensive applications compared to a soft scalar CPU (similar to a FP-extended Nios). For eight benchmark applications, we show that: (i) a base TILT configuration having a single instance for each FU type can improve the performance over a soft scalar CPU by 15.8x, while requiring on average 26% of the custom datapaths' area; (ii) selectively increasing the number of FUs canmore than double TILT's average throughput, reducing the custom-datapath-throughputgap from 576x to 14x; and (iii) replicated instances of the most computationally dense TILT configuration that fit within the area of each custom datapath design can reduce the gap to 8.27x, while replicated instances of application-tuned configurations of TILT can reduce the custom-datapath-throughput-gap to an average of 5.22x, and up to 3.41x for the Matrix Multiply benchmark. Last, we present methods for design space reduction, and we correctly predict the computationally densest design for seven out of eight benchmarks.

机译：通过使用资源共享的现场可编程门阵列（FPGA）计算引擎，我们可以缩小软标量CPU与资源密集型自定义数据路径设计之间的性能差距。本文证明，线程和指令级并行模板架构（TILT）是一种基于FPGA的可编程水平微编码计算引擎，旨在高度利用浮点（FP）功能单元（FU），可以显着提高八个FP的平均吞吐量。与软标量CPU（类似于FP扩展的Nios）相比，应用程序密集型。对于八个基准应用程序，我们表明：（i）对于每种FU类型具有单个实例的基本TILT配置，可以将软标量CPU的性能提高15.8倍，同时平均需要自定义数据路径面积的26％；（ii）有选择地增加FU的数量，可以使TILT的平均吞吐量增加一倍以上，从而将自定义数据路径吞吐量的差距从576x减少到14x；（iii）计算复杂度最高的TILT配置的复制实例可以适合每个自定义数据路径设计的区域，可以将差距减小到8.27倍，而TILT的应用程序优化配置的复制实例可以减少自定义数据路径吞吐量-平均差距为5.22倍，对于Matrix Multiply基准，差距最大为3.41倍。最后，我们介绍了减少设计空间的方法，并且针对八项基准测试中的七项，我们正确地预测了计算密度最高的设计。

著录项

来源
《ACM transactions on reconfigurable technology and systems》 |2017年第3期|22.1-22.23|共23页
作者
Tili Ilian; Ovtcharov Kalin; Steffan J. Gregory;
展开▼
作者单位

Univ Toronto, Edward S Rogers Sr Dept Elect & Comp Engn, 10 Kings Coll Rd, Toronto, ON M5S 3G4, Canada;

Univ Toronto, Edward S Rogers Sr Dept Elect & Comp Engn, 10 Kings Coll Rd, Toronto, ON M5S 3G4, Canada;

Univ Toronto, Edward S Rogers Sr Dept Elect & Comp Engn, 10 Kings Coll Rd, Toronto, ON M5S 3G4, Canada;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Soft processors; FPGA; scheduling; compiling; throughput; computational density; design space; computer architecture;

机译：软处理器;FPGA;调度;编译;吞吐量;计算密度;设计空间;计算机体系结构;

相似文献

外文文献
中文文献
专利

1. Co-Z ECC scalar multiplications for hardware, software and hardware-software co-design on embedded systems [J] . Brian Baldwin, Raveen R. Goundar, Mark Hamilton, Journal of cryptographic engineering . 2012,第4期

机译：嵌入式系统上用于硬件，软件和软硬件协同设计的Co-Z ECC标量乘法
2. Proficient Design Space Exploration of ZYNQ SoC using VIVADO Design Suite: Custom Design of High Performance AXI Interface for High speed data transfer between PL and DDR Memory using Hardware-Software Co-Design [J] . Rikin J. Nayak, Jaiminkumar B. Chavda International Journal of Applied Engineering Research . 2018,第11aPta2期

机译：使用Vivado设计套件熟练设计空间探索Zynq SoC：高性能AXI接口的定制设计，用于使用硬件 - 软件共同设计的PL和DDR内存高速数据传输
3. Dynamic scheduler implementation used for load distribution between hardware accelerators (RTL) and software tasks (CPU) in heterogeneous systems [J] . Tanase Cristian Andy Journal of supercomputing . 2020,第12期

机译：用于异构系统中的硬件加速器（RTL）（RTL）和软件任务（CPU）之间的负载分布的动态调度器实现
4. Genetic algorithms in software and in hardware-a performance analysis of workstation and custom computing machine implementations [C] . Graham, P., Nelson, FPGAs for Custom Computing Machines, 1996. Proceedings. IEEE Symposium on . 1996

机译：软件和硬件中的遗传算法-工作站和定制计算机实现的性能分析
5. Acceleration of Computer Vision Algorithms Using Soft-Core CPUs with Custom Hardware [D] . Grover, Eric Richard. 2017

机译：使用具有自定义硬件的软核心CPU加速计算机视觉算法
6. Performance data of multiple-precision scalar and vector BLAS operations on CPU and GPU [O] . Konstantin Isupov 2020

机译：CPU和GPU上的多精度标量和矢量BLAS操作的性能数据
7. Genetic Algorithms In Software and In Hardware - A Performance Analysis Of Workstation and Custom Computing Machine Implementations [O] . Paul Graham, Brent Nelson 1996

机译：软件和硬件中的遗传算法 - 工作站和定制计算机实现的性能分析

Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅