...
首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
【24h】

Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods

机译:高性能计算的硬件加速器集成权衡:N-Body方法中的Gemm加速度的案例研究

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this article, we study performance and energy saving benefits of hardware acceleration under different hardware configurations and usage scenarios for a state-of-the-art Fast Multipole Method (FMM), which is a popular N-body method. We use a dedicated Application Specific Integrated Circuit (ASIC) to accelerate General Matrix-Matrix Multiply (GEMM) operations. FMM is widely used in applications and is representative example of the workload for many HPC applications. We compare architectures that integrate the GEMM ASIC next to, in or near main memory with an on-chip coupling aimed at minimizing or avoiding repeated round-trip transfers through DRAM for communication between accelerator and CPU. We study tradeoffs using detailed and accurately calibrated x86 CPU, accelerator and DRAM simulations. Our results show that simply moving accelerators closer to the chip does not necessarily lead to performance/energy gains. We demonstrate that, while careful software blocking and on-chip placement optimizations can reduce DRAM accesses by 2X over a naive on-chip integration, these dramatic savings in DRAM traffic do not automatically translate into significant total energy or runtime savings. This is chiefly due to the application characteristics, the high idle power and effective hiding of memory latencies in modern systems. Only when more aggressive co-optimizations such as software pipelining and overlapping are applied, additional performance and energy savings can be unlocked by 37 and 35 percent respectively over baseline acceleration. When similar optimizations (pipelining and overlapping) are applied with an off-chip integration, on-chip integration delivers up to 20 percent better performance and 17 percent less total energy consumption than off-chip integration.
机译:在本文中,我们在不同的硬件配置和使用场景下研究了硬件加速的性能和节能优势,以实现最先进的快速多极方法(FMM),这是一种流行的N-Body方法。我们使用专用应用特定的集成电路(ASIC)来加速通用矩阵矩阵乘以(GEMM)操作。 FMM广泛用于应用中,并且是许多HPC应用程序的工作量的代表性示例。我们将架构进行比较,该架构将GEMM ASIC旁边集成,进一步或靠近主存储器的芯片耦合,旨在通过DRAM进行最小化或避免重复的往返转移,以便在加速器和CPU之间进行通信。我们使用详细准确地校准的X86 CPU,加速器和DRAM模拟学习权衡。我们的结果表明,只需将较近芯片更靠近芯片的加速器并不一定导致性能/能源收益。我们证明,虽然仔细的软件阻塞和片上放置优化可以减少2倍的DRAM在一个天真的片上集成,但DRAM流量的这些戏剧节省不会自动转化为显着的总能量或运行时节省。这主要是由于应用特征,高怠速功率和现代系统内存延迟的有效隐藏。只有当应用软件流水线和重叠等更具侵略性的共同优化时,才能分别在基线加速度上释放额外的性能和节能37和35%。当应用类似优化(流水线和重叠)时,随着片内集成,片上集成可提供比片外集成更好的性能更好的性能和17%,总能耗较差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号