首页> 外文会议>ACM international conference on supercomputing >FPGA Accelerating Double/Quad-Double High Precision Floating-Point Applications for ExaScale Computing
【24h】

FPGA Accelerating Double/Quad-Double High Precision Floating-Point Applications for ExaScale Computing

机译:FPGA加速DiCe /四倍双高精度浮点应用,用于Exascale Computing

获取原文

摘要

In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate scientific computing applications which require very high precision arithmetic, based on 128-bit or even 256-bit floating-point number representation. This paper addresses the accuracy when performing LU decomposition on large-scale matrices. In future ExaScale computing environments, accuracy errors are expected to increase up to a level which leaves only 11 significant bits in the mantissa. This is caused by the required large amount of accumulation operations which are in the order of O(n~3). Using exact long fixed-point numbers instead of usual floatingpoint numbers in the accumulation process, leads to exact accumulation results with only one bit error, originated by the rounding in the last normalization step. We have developed two types of High Precision Multiplication and Accumulation (HP-MAC), for Double-Double (128 bits) and Quad-Double (256 bits) floating-point, respectively, and implemented them into FPGA devices. We propose a two-level RAM banks scheme to store and add long fixed-point numbers with minimized crucial data paths lengths. We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations, by dividing the summation function into 4 partial operations, processed in 4 banks. To prove the concept, we prototyped six 128-bit HP-MAC units into a Xilinx Virtex-5 XC5VLX330 FPGA chip and performed LU decomposition. The experimental results show accuracy improvement of 10 to 24 bits, compared to a software approach with similar precision arithmetic. Moreover, our LU decomposition implementation, based on FPGA running at 133MHz, achieves 29X-56X better performance and much lower power consumption compared to the use of a software-based library running on an Intel Core2 Quad Q8200 CPU at 2.33GHz.
机译:在本文中,我们探讨了FPGA解决方案的能力和灵活性,以加速需要非常高精度算术的科学计算应用,基于128位甚至256位浮点数表示。本文在大规模矩阵上执行LU分解时的准确性。在未来的ExaScale计算环境中,预计准确性误差将增加到尾部仅在尾数中仅留下11位的水平。这是由所需的大量累积操作引起的,这是O(n〜3)的顺序。在累积过程中使用精确的长固定点编号而不是通常的浮点数,导致只有一个误差的精确累积结果,源于最后一个归一化步骤中的舍入。我们开发了两种类型的高精度乘法和累积(HP-MAC),分别为双双(128位)和四倍(256位)浮点,并将其实施到FPGA器件中。我们提出了一个双层RAM银行方案来存储和添加长期固定点数,最小化的至关重要的数据路径长度。我们还介绍了部分总和的方案,以提高MAC运营的管道吞吐量,通过将求和函数除以4个部分运营,在4个银行处理。为了证明这一概念,我们将六个128位HP-MAC单元原样为Xilinx Virtex-5 XC5VLX330 FPGA芯片并进行了LU分解。实验结果表明,与具有相似精度算术的软件方法相比,准确提高了10到24位。此外,根据FPGA在133MHz运行的FPGA的情况下,我们的LU分解实现达到了29x-56x的性能,而且与在2.33GHz上的英特尔Core2 Q8200 CPU上运行的基于软件的库相比,更好的性能和更低的功耗。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号