首页> 外文期刊>ACM transactions on reconfigurable technology and systems >A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices
【24h】

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

机译:密集矩阵基于FPGA的高吞吐量浮点共轭梯度实现

获取原文
获取原文并翻译 | 示例

摘要

Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n~2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.
机译:现代现场可编程门阵列(FPGA)容量的最新发展极大地扩展了其应用范围。这样的领域之一是科学计算的加速,而在科学计算中很常见的一种计算类型是线性方程组的解。共轭梯度(CG)算法是一种已在软件中证明非常有效且鲁棒的方法,可以找到此类解决方案。在本文中,我们提出了针对现代FPGA架构的广泛并行且深入流水线化的硬件CG实现。此实现特别适合于加速多个线性方程的中小型稠密系统,并且可用作独立求解器或构建高阶系统的构建块。本文表明,通过并行化,可以将n阶矩阵每次迭代的计算时间从微处理器上的Θ(n〜2)个时钟周期转换为FPGA上的Θ(n)。通过深度流水线处理,还可以并行解决多个问题,并最大限度地提高性能和效率。随着矩阵顺序的增加,I / O需求显示出可伸缩性并收敛到一个恒定值。易于获得的VirtexII-6000上的放置和布线后结果显示了5个GFlop的持续性能,而Virtex5-330上的结果表明35个GFlop的持续性能。与在高端CPU上运行的优化软件实现的比较表明,这种FPGA实现代表了至少一个数量级的显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号