...
首页> 外文期刊>ACM transactions on reconfigurable technology and systems >FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods
【24h】

FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

机译:具有有限带宽减少机制的FPGA阵列,基于有限差分方法可扩展且高效节能的数值模拟

获取原文
获取原文并翻译 | 示例
           

摘要

For scientific numerical simulation that requires a relatively high ratio of data access to computation, the scalability of memory bandwidth is the key to performance improvement, and therefore custom-computing machines (CCMs) are one of the promising approaches to provide bandwidth-aware structures tailored for individual applications. In this article, we propose a scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods. With the FPGA-array, we construct a systolic computational-memory array (SCMA), which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Since the systolic computational-memory architecture of SCMA provides scalability of both memory bandwidth and arithmetic performance according to the array size, we introduce a homogeneously partitioning approach to the SCMA so that it is extensible over a ID or 2D array of FPGAs connected with a mesh network. To satisfy the bandwidth requirement of inter-FPGA communication, we propose BRM based on time-division multiplexing. BRM decreases the required number of communication channels between the adjacent FPGAs at the cost of delay cycles. We formulate the trade-off between bandwidth and delay of inter-FPGA data-transfer with BRM. To demonstrate feasibility and evaluate performance quantitatively, we design and implement the SCMA of 192 processing elements over two ALTERA Stratix II FPGAs. The implemented SCMA running at 106MHz has the peak performance of 40.7 GFlops in single precision. We demonstrate that the SCMA achieves the sustained performances of 32.8 to 35.7 GFlops for three benchmark computations with high utilization of computing units. The SCMA has complete scalability to the increasing number of FPGAs due to the highly localized computation and communication. In addition, we also demonstrate that the FPGA-based SCMA is power-efficient: it consumes 69% to 87% power and requires only 2.8% to 7.0% energy of those for the same computations performed by a 3.4-GHz Pentium4 processor. With software simulation, we show that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPGA-array.
机译:对于需要较高数据访问率的科学数值模拟,内存带宽的可扩展性是性能提高的关键,因此定制计算机(CCM)是提供量身定制的带宽感知结构的有前途的方法之一针对个人应用。在本文中,我们提出了一种具有带宽减少机制(BRM)的可扩展FPGA阵列,以基于有限差分方法为科学仿真实现高性能和高能效的CCM。使用FPGA阵列,我们构造了一个脉动计算存储器阵列(SCMA),该阵列具有最小的可编程性,可以​​为各种计算内核和边界计算提供灵活性和高生产率。由于SCMA的脉动计算内存架构可根据阵列大小同时提供内存带宽和算术性能的可扩展性,因此我们向SCMA引入了均匀分区方法,从而使其可扩展至ID或与网格连接的2D FPGA阵列网络。为了满足FPGA间通信的带宽要求,我们提出了基于时分复用的BRM。 BRM以延迟周期为代价,减少了相邻FPGA之间所需的通信通道数量。我们用BRM制定了FPGA间数据传输的带宽和延迟之间的折衷方案。为了证明可行性并定量评估性能,我们在两个ALTERA Stratix II FPGA上设计和实现了192个处理元件的SCMA。以106MHz运行的已实现SCMA具有单精度40.7 GFlops的峰值性能。我们证明,对于三个基准计算,SCMA可以实现32.8至35.7 GFlop的持续性能,并具有较高的计算单元利用率。由于高度本地化的计算和通信,SCMA具有对越来越多的FPGA的完全可扩展性。此外,我们还证明了基于FPGA的SCMA的功耗效率高:与3.4 GHz Pentium4处理器执行的相同计算相比,它消耗69%至87%的功率,仅需要2.8%至7.0%的能量。通过软件仿真,我们表明BRM可有效地进行基准计算,因此,可以利用具有相对窄的I / O带宽的市售低端FPGA来构建可扩展的FPGA阵列。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号