首页> 外文学位 >Methods for Reducing Floating-Point Computation Overhead
【24h】

Methods for Reducing Floating-Point Computation Overhead

机译:减少浮点计算开销的方法

获取原文
获取原文并翻译 | 示例

摘要

Despite floating-point (FP) being the most commonly used method for real number representation, certain architectures are still limited to fixed-point arithmetic due to the large area and power requirements of FP hardware. A software library, which emulates FP functions, is typically implemented when FP calculations need to be performed on a platform with a fixed-point datapath. However, software implementations of FP operations, despite not requiring any additional area, suffer from a low throughput. Conversely, hardware FP implementations provide high throughput, but require a large amount of additional area and consequently increase leakage. Therefore, it is desirable to increase the FP throughput provided by a software implementation without incurring the area overhead of a full hardware floating-point unit (FPU). Furthermore, the widths of data words in digital processors have a direct impact on area in application-specific ICs (ASICs) and field-programmable gate arrays (FPGAs). Circuit area impacts energy dissipation per workload and chip cost. Graphics and image processing workloads are very FP intensive, however, little exploration has been done into modifying FP word width and observing its effect on image quality and chip area.;This dissertation first presents hybrid FP implementations, which improve software FP performance without incurring the area overhead of full hardware FPUs. The proposed implementations are synthesized in 65 nm complementary metal oxide semiconductor (CMOS) technology and integrated into small fixed-point processors which use a reduced instruction set computing (RISC)-like architecture. Unsigned, shift-carry, and leading zero detection (USL) support is added to the processors to augment the existing instruction set architecture (ISA) and increase FP throughput with little area overhead. Two variations of hybrid implementations are created. USL support is additional general purpose hardware that is not specific to FP workloads (e.g., unsigned operation support), custom FP-specific (CFP) hardware is specifically for FP workload acceleration (e.g., exponent calculation logic). The first, hybrid implementations with USL support, increase software FP throughput per core by 2.18x for addition/subtraction, 1.29x for multiplication, 3.07--4.05x for division, and 3.11--3.81x for square root, and use 90.7--94.6% less area than dedicated fused multiply-add (FMA) hardware. The second type of hybrid implementations, those with CFP hardware, increase throughput per core over a fixed-point software kernel by 3.69--7.28x for addition/subtraction, 1.22--2.03x for multiplication, 14.4x for division, and 31.9x for square root, and use 77.3--97.0% less area than dedicated fused multiply-add hardware. The circuit area and throughput are found for 38 multiply-add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. 33 multiply-add implementations are presented which improve throughput per core versus a fixed-point software implementation by 1.11--15.9x and use 38.2--95.3% less area than dedicated FMA hardware.;In addition to proposing hybrid FP implementations, this dissertation investigates the effects of modifying FP word width. For the second portion of this dissertation, FP exponent and mantissa widths are independently varied for the seven major computational blocks of an airborne synthetic aperture radar (SAR) image formation engine. This image formation engine uses the backprojection algorithm. SAR imaging uses pulses of microwave energy to provide day, night, and all-weather imaging and can be used for reconnaissance, navigation, and environment monitoring. The backprojection algorithm is a frequently used tomographic reconstruction method similar to that used in computed tomography (CT) imaging. Additionally, trigonometric function evaluation, interpolation, and Fourier transforms are common to SAR backprojection and other biomedical image formation algorithms. The circuit area in 65 nm CMOS and the peak signal-to-noise ratio (PSNR) and structural similarity index metric (SSIM) are found for 572 design points. With word width reductions of 46.9--79.7%, images with a 0.99 SSIM are created with imperceptible image quality degradation and a 1.9--11.4x area reduction.;The third portion of this dissertation covers the physical design of two many-core chips in 32 nm PD-SOI, KiloCore and KiloCore2. In the first portion of this section, the design of KiloCore is covered, while the second portion details the adjustments made to the flow for the tape-out of KiloCore2. KiloCore features 1000 cores capable of independent program execution. The maximum clock frequency for the cores on KiloCore range from 1.70 GHz to 1.87 GHz at 1.10 V. KiloCore compares favorably against many other many-core and multi-core chips, as well as low power processors. At a supply voltage of 0.56 V, processors require 5.8 pJ per operation at a clock frequency of 115 MHz. KiloCore2 has 700 cores, 697 of which are programmable processor tiles, and three which are hardware accelerators (a fast Fourier transform (FFT) accelerator, and two Viterbi decoders). The assembled printed circuit boards (PCBs) with packaged KiloCore2 chips are expected to be ready in July.;The fourth portion of this dissertation explores implementing a scientific kernel on a many-core array, namely sparse matrix-vector multiplication. Twenty-three functionally equivalent sparse matrix times dense vector multiplication implementations are created for a fine-grained many-core platform with FP capabilities. These implementations are considered against two central processing unit (CPU) chips and two graphics processing unit (GPU) chips. The designs for the many-core array, CPUs, and GPUs are evaluated using the metrics of throughput per area and throughput per watt when operating on a set of five unstructured sparse matrices of varying dimensions, sourced from a wide range of domains including directed weighted graphs, computational fluid dynamics, circuit simulation, thermal problems (e.g., heat exchanger design), and eigenvalue/model reduction problems. Results using unscheduled and unoptimized code demonstrate that the implementations on the many-core platform increase power efficiency by up to 14.0x versus the CPU implementations, and by up to 27.9x versus the GPU implementations. Additionally, the implementations on the many-core platform increase area efficiency by as much as 17.8x versus the CPU implementations, and up to 36.6x versus the GPU implementations.
机译:尽管浮点(FP)是最常用的用于实数表示的方法,但由于FP硬件的面积大和功耗要求高,某些体系结构仍限于定点算法。当需要在具有定点数据路径的平台上执行FP计算时,通常会实现模拟FP功能的软件库。但是,尽管不需要任何额外的区域,但是FP操作的软件实现会遇到吞吐量低的问题。相反,硬件FP实施提供了高吞吐量,但是需要大量的额外区域,因此增加了泄漏。因此,期望增加由软件实施方式提供的FP吞吐量,而不会引起完整的硬件浮点单元(FPU)的面积开销。此外,数字处理器中数据字的宽度直接影响专用IC(ASIC)和现场可编程门阵列(FPGA)中的面积。电路面积会影响每个工作负载的能耗和芯片成本。图形和图像处理工作量是FP密集型,但是,很少有探索来修改FP字宽,并观察其对图像质量和芯片面积的影响。;本文首先提出了混合FP实现,它提高了软件FP的性能而不会引起完整硬件FPU的区域开销。拟议的实现是在65 nm互补金属氧化物半导体(CMOS)技术中合成的,并集成到使用类似于指令集计算(RISC)的简化体系结构的小型定点处理器中。无符号,移位和领先的零检测(USL)支持已添加到处理器,以扩大现有的指令集体系结构(ISA)并以很少的区域开销增加FP吞吐量。创建了混合实现的两个变体。 USL支持是不是专用于FP工作负载的其他通用硬件(例如,无符号操作支持),自定义FP专用(CFP)硬件专门用于FP工作负载的加速(例如,指数计算逻辑)。第一个采用USL支持的混合实现将每个内核的软件FP吞吐量提高了2.18倍(用于加法/减法),1.29倍(用于乘法),3.07--4.05倍(用于除法)和3.11--3.81倍(用于平方根),并使用90.7-与专用的熔合乘法(FMA)硬件相比,面积减少了-94.6%。第二种类型的混合实现(具有CFP硬件的实现)使定点软件内核上的每核吞吐量增加3.69--7.28x(加/减),1.22--2.03x(乘法),14.4x(除法)和31.9x用于平方根,并且比专用的融合乘加硬件少使用77.3--97.0%的面积。找到了38个乘加,8个加/减,6个乘法,45个除法和45个平方根设计的电路面积和吞吐量。提出了33种乘加实现,与专用FMA硬件相比,与定点软件实现相比,每内核吞吐量提高了1.11--15.9倍,并且使用的面积减少了38.2--95.3%。研究修改FP字宽的影响。对于本文的第二部分,机载合成孔径雷达(SAR)成像引擎的七个主要计算模块的FP指数和尾数宽度独立变化。该图像形成引擎使用反投影算法。 SAR成像使用微波能量脉冲提供白天,黑夜和全天候成像,并可用于侦察,导航和环境监控。反投影算法是一种常用的层析成像重建方法,类似于计算机层析成像(CT)成像中使用的方法。此外,三角函数评估,插值和傅立叶变换对于SAR反投影和其他生物医学图像形成算法是通用的。找到了572个设计点的65 nm CMOS电路面积以及峰值信噪比(PSNR)和结构相似性指标度量(SSIM)。在字宽减少46.9--79.7%的情况下,创建具有0.99 SSIM的图像时,图像质量下降不明显,面积减少1.9--11.4x .;本论文的第三部分介绍了两个多核芯片的物理设计采用32 nm PD-SOI,KiloCore和KiloCore2。在本节的第一部分中,介绍了KiloCore的设计,而第二部分详细介绍了对KiloCore2的流片流程所做的调整。 KiloCore具有1000个可独立执行程序的内核。 KiloCore内核在1.10 V时的最大时钟频率范围为1.70 GHz至1.87 GHz。与许多其他多核和多核芯片以及低功耗处理器相比,KiloCore具有优越的性能。电源电压为0.56 V时,处理器在115 MHz的时钟频率下每次操作需要5.8 pJ。 KiloCore2具有700个内核,其中697个是可编程处理器块,三个是硬件加速器(快速傅里叶变换(FFT)加速器和两个Viterbi解码器)。带有封装的KiloCore2芯片的组装好的印刷电路板(PCB)有望在7月份准备好。本论文的第四部分探讨了在多核阵列上实现科学核的方法,即稀疏矩阵矢量乘法。为具有FP功能的细粒度多核平台创建了23个功能上等效的稀疏矩阵乘以密集矢量乘法实现。针对两个中央处理单元(CPU)芯片和两个图形处理单元(GPU)芯片考虑了这些实现。多核阵列,CPU和GPU的设计在五个不同尺寸的非结构化稀疏矩阵的集合上运行时,使用每单位面积的吞吐量和每瓦的吞吐量的指标进行评估,这些矩阵来自广泛的领域,包括定向加权图形,计算流体动力学,回路仿真,热问题(例如,换热器设计)以及特征值/模型简化问题。使用未计划和未优化的代码得到的结果表明,与CPU实现相比,多核平台上的实现将电源效率提高了14.0倍,而与GPU实现相比,则将电源效率提高了27.9倍。此外,与CPU实施相比,多核平台上的实施可将区域效率提高多达17.8倍,与GPU实施相比,可将面积效率提高多达36.6倍。

著录项

  • 作者

    Pimentel, Jon.;

  • 作者单位

    University of California, Davis.;

  • 授予单位 University of California, Davis.;
  • 学科 Computer engineering.;Electrical engineering.;Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 186 p.
  • 总页数 186
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号