首页> 外文会议>IEEE Annual International Symposium on Field-Programmable Custom Computing Machines >Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications on FPGAs
【24h】

Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications on FPGAs

机译:发条:用于FPGA上的多速率图像处理应用的资源高效静态调度

获取原文

摘要

Image processing applications can benefit tremendously from FPGA acceleration. However, hardware accelerators for these applications look very different from the programs that image processing algorithm designers are accustomed to writing. As a result, many image processing hardware compilers have been designed to generate hardware accelerators from high-level specifications of image processing algorithms. Unfortunately, all of these compilers either exclude crucial access patterns, do not scale to realistic size applications, or rely on a compilation process in which each stage of the application is an independently scheduled module that sends data to its consumers through FIFOs which adds resource and energy overhead while inhibiting synthesis optimizations.In this paper we present a new algorithm for compiling image processing applications, Clockwork, that uses a combination of techniques from polyhedral analysis and synchronous dataflow (SDF) to overcome these limitations. Clockwork compiles the entire application into one flat, statically scheduled module. As a result, accelerators produced by Clockwork have fixed latency, cannot deadlock, and have no resource overhead from inter-stage FIFOs. We show that designs generated by Clockwork achieve on average a 55% reduction in LUTs, a 30% reduction in flip-flops, and a 22% reduction in BRAMs compared to a state-of-the-art stencil compiler at the same throughput, while handling a wider range of access patterns. Clockwork scales to applications with more than 100,000 LUTs. For an application with dozens of stages, Clockwork achieves energy efficiency 260x that of an 8 thread Intel CPU, 17x that of an NVIDIA K80 GPU, and 2.4x that of an NVIDIA V100 GPU.
机译:图像处理应用程序可以从FPGA加速度受益。但是,这些应用程序的硬件加速器看起来与图像处理算法设计人员习惯于写作的程序非常不同。因此,许多图像处理硬件编译器曾设计用于从图像处理算法的高级规格生成硬件加速器。不幸的是,所有这些编译器都排除了关键访问模式,不要扩展到现实大小应用程序,或者依赖于应用程序的每个阶段是一个独立计划的模块,该模块通过添加资源的FIFO将数据发送到其消费者的数据。抑制综合优化的同时能量开销。本文介绍了一种用于编译图像处理应用,发条的新算法,它使用来自多面体分析和同步数据流(SDF)的技术组合来克服这些限制。发条编译成一个平面静态调度模块的整个应用程序。因此,发条产生的加速器具有固定延迟,不能死锁,并且从级别FIFO之间没有资源开销。我们展示了发条产生的设计平均降低了LUT的55%,触发器减少了30%,与相同吞吐量的最先进的模板编译器相比,框的减少22%,在处理更广泛的访问模式时。发条秤尺度超过100,000个LUT的应用程序。对于具有数十个阶段的应用,发条效率实现了860倍的能效260X,其中8个线程Intel CPU,17倍的NVIDIA K80 GPU,2.4倍的NVIDIA V100 GPU。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号