首页> 外文会议>IEEE Annual International Symposium on Field-Programmable Custom Computing Machines >A High-Level Synthesis Approach Optimizing Accumulations in Floating-Point Programs Using Custom Formats and Operators
【24h】

A High-Level Synthesis Approach Optimizing Accumulations in Floating-Point Programs Using Custom Formats and Operators

机译:使用自定义格式和运算符优化浮点程序中累积的高级综合方法

获取原文

摘要

Many case studies have demonstrated the potential of Field-Programmable Gate Arrays (FPGAs) as accelerators for a wide range of applications. FPGAs offer massive parallelism and programmability at the bit level. This enables programmers to exploit a range of techniques that avoid many bottlenecks of classical von Neumann computing. However, development costs for FPGAs are orders of magnitude higher than classical programming. A solution would be the use of High-Level Synthesis (HLS) tools, which use C as a hardware description language. However, the C language was designed to be executed on general purpose processors, not to generate hardware. Its datatypes and operators are limited to a small number (more or less matching the hardware operators present in mainstream processors), and HLS tools inherit these limitations. To better exploit the freedom offered by hardware and FPGAs, HLS vendors have enriched the C language with integer and fixed-point types of arbitrary size. Still, the operations on these types remain limited to the basic arithmetic and logic ones. In floating point, the current situation is even worse. The operator set is limited, and the sizes are restricted to 32 and 64 bits. Besides, most recent compilers, including the HLS ones, attempt to follow established standards, in particular C11 and IEEE-754. This ensures bit-exact compatibility with software, but greatly reduces the freedom of optimization by the compiler. For instance, a floating point addition is not associative even though its real equivalent is. In the present work we attempt to give the compiler more freedom. For this, we sacrifice the strict respect of the IEEE-754 and C11 standards, but we replace it with the strict respect of a high-level accuracy specification expressed by the programmer through a pragma. The case study in this work is a program transformation that applies to floating-point additions on a loop's critical path. It decomposes them into elementary steps, resizes the corresponding subcomponents to guarantee some user-specified accuracy, and merges and reorders these components to improve performance. The result of this complex sequence of optimizations could not be obtained from an operator generator, since it involves global loop information. For this purpose, we used a compilation flow involving one or several source-to-source transformations operating on the code given to HLS tools (Figure 1).The proposed transformation already works very well on 3 of the 10 FPMarks where it improves both latency and accuracy by an order of magnitude for comparable area. For 2 more benchmarks, the latency is not improved (but not degraded either) due to current limitations of HLS tools. This defines short-term future work. The main result of this work is that HLS tools also have the potential to generate efficient designs for handling floating-point computations in a completely non-standard way. In the longer term, we believe that HLS flows can not only import application-specific operators from the FPGA literature, they can also improve them using high-level, program-level information.
机译:许多案例研究证明了现场可编程门阵列(FPGA)作为广泛应用中的加速器的潜力。 FPGA在位级别提供了巨​​大的并行性和可编程性。这使程序员能够利用一系列技术来避免经典冯·诺依曼计算的许多瓶颈。但是,FPGA的开发成本比传统编程要高几个数量级。一种解决方案是使用高级综合(HLS)工具,该工具使用C作为硬件描述语言。但是,C语言被设计为在通用处理器上执行,而不是生成硬件。它的数据类型和运算符被限制为少数(或多或少与主流处理器中存在的硬件运算符相匹配),并且HLS工具继承了这些限制。为了更好地利用硬件和FPGA提供的自由,HLS供应商使用任意大小的整数和定点类型丰富了C语言。尽管如此,对这些类型的运算仍然仅限于基本的算术和逻辑运算。在浮点上,当前的情况甚至更糟。运算符集受到限制,并且大小限制为32位和64位。此外,包括HLS在内的大多数最新编译器都试图遵循既定标准,尤其是C11和IEEE-754。这样可确保与软件的位精确兼容性,但大大降低了编译器进行优化的自由度。例如,即使浮点加法的真正等效项是关联的,它也不是关联的。在当前的工作中,我们试图赋予编译器更多的自由。为此,我们牺牲了对IEEE-754和C11标准的严格尊重,但我们以程序员通过实用性表示的高级准确性规范的严格尊重代替了它。这项工作中的案例研究是一种程序转换,适用于循环关键路径上的浮点加法运算。它将它们分解为基本步骤,调整相应子组件的大小以确保某些用户指定的准确性,并对这些组件进行合并和重新排序以提高性能。无法从运算符生成器中获得此复杂的优化序列的结果,因为它涉及全局循环信息。为此,我们使用了一个编译流程,其中涉及一个或多个源到源转换,这些转换是对提供给HLS工具的代码进行操作的(图1)。建议的转换已经在10个FPMark中的3个上表现良好,可以改善两个延迟和可比面积的精度提高了一个数量级。对于另外2个基准,由于HLS工具的当前限制,延迟没有得到改善(但也没有降低)。这定义了短期的未来工作。这项工作的主要结果是,HLS工具还具有生成高效设计的潜力,从而可以以完全非标准的方式处理浮点计算。从长远来看,我们认为HLS流不仅可以从FPGA文献中导入特定于应用程序的运算符,而且还可以使用高级程序级信息来改进它们。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号