首页> 外文会议>2010 DoD High Performance Computing Modernization Program Users Group Conference >Accelerating a Sparse Matrix Iterative Solver Using a High Performance Reconfigurable Computer
【24h】

Accelerating a Sparse Matrix Iterative Solver Using a High Performance Reconfigurable Computer

机译:使用高性能可重构计算机加速稀疏矩阵迭代求解器

获取原文

摘要

High performance reconfigurable computers (HPRCs), which combine general-purpose processors (GPPs) and field programmable gate arrays (FPGAs), are now commercially available. These interesting architectures allow for the creation of reconfigurable processors. HPRCs have already been used to accelerate integer and fixed-point applications. However, extensive parallelism and deeply pipelined floating-point cores are necessary to make MHz-scale FPGAs competitive with GHz-scale GPPs, thus making it difficult to accelerate certain kinds of floating-point kernels. Kernels with variable length nested loops, e.g., sparse matrix-vector multiply, have been problematic because of the loop-carried dependence associated with the pipelined floating-point units. While hardware description language (HDL)-based kernels have shown moderate success in addressing this problem, the use of a high-level language (HLL)-based approach to accelerate such applications has been rather elusive. If HPRCs are to become a part of mainstream military and scientific computing, we should emphasize the use of HLL-based programming, whenever possible, rather than HDL-based hardware design. The primary reason is the increased programmer productivity associated with HLLs when compared with HDLs. For example, the floating-point addition statement z = x+y, a single line in an HLL, corresponds to hundreds of lines of HDL. In this paper, we describe the design and implementation of a sparse matrix Jacobi processor to solve systems of linear equations, Ax=b. The parallelized, deeply pipelined, IEEE-754-compliant 32-bit floating-point sparse matrix Jacobi iterative solver runs on a contemporary HPRC. The FPGA-based components are implemented using only an HLL (the C programming language) and the Carte HLL-to-HDL compiler. An HLL-based streaming accumulator allows for the implementation of fully pipelined loops and results in a 2.5-fold wall clock runtime speedup when compared with an equivalent software-only i--mplementation.
机译:结合了通用处理器(GPPs)和现场可编程门阵列(FPGA)的高性能可重新配置计算机(HPRC)现在可以在市场上买到。这些有趣的体系结构允许创建可重新配置的处理器。 HPRC已用于加速整数和定点应用程序。但是,要使MHz级FPGA与GHz级GPP竞争,必须具有广泛的并行性和深度流水线式浮点内核,因此很难加速某些种类的浮点内核。具有可变长度的嵌套循环(例如,稀疏矩阵矢量乘法)的内核由于与流水线浮点单元相关联的循环携带依赖性而成为问题。尽管基于硬件描述语言(HDL)的内核在解决此问题方面已显示出一定程度的成功,但是使用基于高级语言(HLL)的方法来加速此类应用程序却相当困难。如果HPRC要成为主流军事和科学计算的一部分,我们应该尽可能强调使用基于HLL的编程,而不是基于HDL的硬件设计。主要原因是与HDL相比,与HLL相关的程序员生产率提高了。例如,HLL中的单行浮点加法语句z = x + y对应于HDL的数百行。在本文中,我们描述了稀疏矩阵Jacobi处理器的设计和实现,用于求解线性方程组Ax = b。并行,深度管线化,符合IEEE-754的32位浮点稀疏矩阵Jacobi迭代求解器在现代HPRC上运行。基于FPGA的组件仅使用HLL(C编程语言)和Carte HLL到HDL编译器来实现。与等效的纯软件i-相比,基于HLL的流式累加器可实现完整的流水线循环,并导致2.5倍的挂钟运行时加速。 -- 实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号