...
首页> 外文期刊>WSEAS Transactions on Computers >FPGA-Based Hardware Acceleration on I/O-Bound Scientific Applications
【24h】

FPGA-Based Hardware Acceleration on I/O-Bound Scientific Applications

机译:基于I / O界科学应用的基于FPGA的硬件加速

获取原文
获取原文并翻译 | 示例
           

摘要

Reconfigurable computing using Field Programmable Gate Arrays (FPGAs) is known for its superiority in floating point arithmetic. Scientific applications and long running server applications typically engage in huge amount of iterations of loops with or without loop dependencies over a set of floating point data that may not be loaded to memory as a whole. Therefore, it is inevitable to load partial data to memory and perform computation over the partial data. While accessing data, the CPU will be stalled and the process is called I/O-bound. With the increasing latency gap between CPU and memory, those computation paradigms have alluded to two design dimensions: data dependence elimination and memory latency hiding. In this paper, an FPGA-based computation model is proposed to improve performance in running applications that requires extensively memory accesses in a loop by utilizing the flexibility of the FPGA-based re-configurable architecture. First, an application will be parallelized by using blocking algorithms, each of which is then loaded to the FPGA on-chip memory. Each block will be operated in parallel and thus, the memory latency can be hidden in a manner that overlaps with the computation. The performance evaluation results in terms of representative I/O-bound applications are reported, of which the speedup can be up to 21.8 in running a parallelized blocking QR matrix decomposition algorithm on the proposed computation model.
机译:使用现场可编程门阵列(FPGA)进行的可重构计算以其在浮点运算中的优越性而闻名。科学应用程序和运行时间较长的服务器应用程序通常会对可能无法整体加载到内存中的一组浮点数据进行大量的循环迭代,无论有无循环依赖性。因此,不可避免的是将部分数据加载到存储器并对该部分数据进行计算。访问数据时,CPU将停止工作,该过程称为I / O绑定。随着CPU和内存之间等待时间间隔的增加,这些计算范式已经暗示了两个设计维度:数据依赖消除和内存等待时间隐藏。在本文中,提出了一种基于FPGA的计算模型,以通过利用基于FPGA的可重配置架构的灵活性来提高运行中的应用程序的性能,这些应用程序需要在循环中进行大量内存访问。首先,将使用阻塞算法对应用程序进行并行化,然后将每种算法都加载到FPGA片上存储器中。每个块将并行运行,因此,可以以与计算重叠的方式隐藏存储等待时间。报告了具有代表性的I / O绑定应用程序的性能评估结果,在所提出的计算模型上运行并行化阻塞QR矩阵分解算法时,其加速可以达到21.8。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号