FPGA Accelerating Double/Quad-Double High Precision Floating-Point Applications for ExaScale Computing

机译：FPGA加速用于ExaScale计算的双/四-双高精度浮点应用

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate scientific computing applications which require very high precision arithmetic, based on 128-bit or even 256-bit floating-point number representation.rnThis paper addresses the accuracy when performing LU decomposition on large-scale matrices. In future ExaScale computing environments, accuracy errors are expected to increase up to a level which leaves only 11 significant bits in the mantissa. This is caused by the required large amount of accumulation operations which are in the order of O(n~3). Using exact long fixed-point numbers instead of usual floatingpoint numbers in the accumulation process, leads to exact accumulation results with only one bit error, originated by the rounding in the last normalization step. We have developed two types of High Precision Multiplication and Accumulation (HP-MAC), for Double-Double (128 bits) and Quad-Double (256 bits) floating-point, respectively, and implemented them into FPGA devices. We propose a two-level RAM banks scheme to store and add long fixed-point numbers with minimized crucial data paths lengths. We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations, by dividing the summation function into 4 partial operations, processed in 4 banks. To prove the concept, we prototyped six 128-bit HP-MAC units into a Xilinx Virtex-5 XC5VLX330 FPGA chip and performed LU decomposition.rnThe experimental results show accuracy improvement of 10 to 24 bits, compared to a software approach with similar precision arithmetic. Moreover, our LU decomposition implementation, based on FPGA running at 133MHz, achieves 29X-56X better performance and much lower power consumption compared to the use of a software-based library running on an Intel Core2 Quad Q8200 CPU at 2.33GHz.

机译：本文从某种意义上探讨了FPGA解决方案的功能和灵活性，以基于128位甚至256位浮点数表示形式来加速需要非常高精度算术的科学计算应用.rn本文解决了执行时的准确性在大规模矩阵上的LU分解。在未来的ExaScale计算环境中，预计精度误差会增加到仅在尾数中留下11个有效位的水平。这是由于需要进行大量的O（n〜3）量的累加操作所致。在累加过程中使用精确的长定点数而不是通常的浮点数会导致精确的累加结果，且只有一位错误，这是由最后一个标准化步骤中的舍入引起的。我们分别针对Double-Double（128位）和Quad-Double（256位）浮点开发了两种类型的高精度乘法和累加（HP-MAC），并将它们实现到FPGA器件中。我们提出了一个两级RAM存储区方案，以最小的关键数据路径长度存储和添加长的定点数。通过将求和函数划分为4个部分运算（在4个存储体中进行处理），我们还引入了部分求和的方案以提高MAC操作的流水线吞吐量。为了证明这一概念，我们将六个128位HP-MAC单元原型化到Xilinx Virtex-5 XC5VLX330 FPGA芯片中并进行了LU分解。与具有类似精度算法的软件方法相比，实验结果表明精度提高了10到24位。此外，与在2.33GHz的Intel Core2 Quad Q8200 CPU上运行的基于软件的库相比，我们的LU分解实现基于133MHz的FPGA，可实现29X-56X更好的性能，并且功耗更低。

著录项

来源
《24th ACM international conference on supercomputing 2010》|2010年|p.325-335|共11页
会议地点 Amsterdam(NL);Amsterdam(NL)
作者
Yong Dou; Yuanwu Lei; Guiming Wu; Song Guo; Jie Zhou; Li Shen;
展开▼
作者单位

National Laboratory for ParallelDistributed Processing, NUDT Changsha, P. R. China 410073;

rnNational Laboratory for ParallelDistributed Processing, NUDT Changsha, P. R. China 410073;

rnNational Laboratory for ParallelDistributed Processing, NUDT Changsha, P. R. China 410073;

rnNational Laboratory for ParallelDistributed Processing, NUDT Changsha, P. R. China 410073;

rnNational Laboratory for ParallelDistributed Processing, NUDT Changsha, P. R. China 410073;

rnNational Laboratory for ParallelDistributed Processing, NUDT Changsha, P. R. China 410073;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
double-double precision; quad-double precision; high precision floating-point multiplication and accumulation (HP-MAC); FPGA;

机译：双精度四双精度高精度浮点乘法和累加（HP-MAC）；现场可编程门阵列;

相似文献

外文文献
中文文献
专利

1. Multipliers for Floating-Point Double Precision and Beyond on FPGAs [J] . Sebastian Banescu, Florent de Dinechin, Bogdan Pasca, Computer architecture news . 2010,第4期

机译：在FPGA上用于浮点双精度及更高乘法器
2. FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic [J] . Yuanwu Lei, Yong Dou, Yazhuo Dong, Journal of supercomputing . 2013,第2期

机译：精确点积的FPGA实现及其在可变精度浮点算法中的应用
3. Lossless Compression of Double-Precision Floating-Point Data for Numerical Simulations: Highly Parallelizable Algorithms for GPU Computing [J] . Mamoru OHARA, Takashi YAMAGUCHI IEICE transactions on information and systems . 2012,第12期

机译：用于数值模拟的双精度浮点数据的无损压缩：GPU计算的高度并行化算法
4. FPGA Accelerating Double/Quad-Double High Precision Floating-Point Applications for ExaScale Computing [C] . Yong Dou, Yuanwu Lei, Guiming Wu, ACM international conference on supercomputing . 2010

机译：FPGA加速DiCe /四倍双高精度浮点应用，用于Exascale Computing
5. Caffeinated FPGAs: FPGA Framework for Training and Inference of Convolutional Neural Networks With Reduced Precision Floating-Point Arithmetic [D] . DiCecco, Roberto. 2018

机译：含咖啡因的FPGA：用于训练和推理卷积神经网络的FPGA框架，具有降低的精度浮点算法
6. A new precision medicine initiative at the dawn of exascale computing [O] . Ruth Nussinov, Hyunbum Jang, Guy Nir, 2021

机译：Exascale Computing黎明的新精密医学倡议
7. A Scalable Architecture for Accelerating Multi-operation and Continuous Floating-point Matrix Computing on FPGAs [O] . Longlong Zhang, Yuanxi Peng, Ahui Huang, 2020

机译：一种可扩展架构，用于加速FPGA上的多功能和连续浮点矩阵计算

FPGA Accelerating Double/Quad-Double High Precision Floating-Point Applications for ExaScale Computing

摘要

著录项

相似文献

相关主题

期刊订阅