首页> 外文会议>International Conference on Parallel Processing >Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor
【24h】

Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor

机译:SW26010在SW26010多核处理器上的两个计算齐全的科学核优化

获取原文

摘要

The home-grown SW26010 many-core processor enabled the production of China's first independently developed number-one ranked supercomputer - the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication" (RLC), to share register data among its 8 × 8 computing processing elements (CPEs) via a 2D on-chip network. Such a radical architecture has sparked global researchers' concerns regarding the programming challenges this may cause. To address these concerns, we adopted two compute-bound scientific kernels as benchmarks to identify the potential programming challenges. The first kernel is double-precision general matrix-multiplication (DGEMM). An RLC-friendly algorithm was designed for this kernel to reuse the data that already reside in the registers of 64 CPEs. This novel optimization enables the kernel to achieve up to 88.7% efficiency in one core group of the SW26010. This paper reveals, for the first time, the details of how the highly efficient DGEMM is implemented on the home-grown processor. The second kernel that we used is N-body. Due to the inefficient hardware support for transcendental operations on the SW26010, we replaced the reciprocal square root (rsqrt) instruction of N-body with a software routine to tackle the problem. Based on the programming challenges identified through these two optimized kernels, we proposed a three-level programming guideline for the SW26010. The paper concludes with our crucial finding that the critical step towards bridging the ninja performance gap on the SW26010 is to design an RLC-friendly algorithm to increase arithmetic intensity.
机译:本土SW26010多核处理器使中国首次独立开发的号码排名超级计算机 - Sunway Toinghulight。然而,有限的离上内存带宽的设计使SW26010成为高度内存的处理器。为了补偿此限制,处理器的设计具有独特的硬件特征,“寄存器级通信”(RLC),以通过2D片上网络共享其8×8计算处理元件(CPE)之间的寄存器数据。这种激进的建筑引发了全球研究人员对这可能导致的编程挑战的担忧。为解决这些问题,我们通过了两个计算束缚的科学内核作为基准,以确定潜在的编程挑战。第一个内核是双精度通用矩阵乘法(DGEMM)。为此内核设计了RLC友好的算法,以重用已驻留在64个CPE的寄存器中的数据。这种新颖的优化使得内核能够在SW26010的一个核心组中获得高达88.7%的效率。本文首次揭示了高效DGEMM如何在本土处理器上实施的细节。我们使用的第二个内核是n-body。由于对SW26010对超越操作的低效硬件支持,我们用软件例程替换了N-MOD的互易方形根(RSQRT)指令以解决问题。基于通过这两个优化内核识别的编程挑战,我们提出了SW26010的三级编程指南。本文得出结论,我们至关重要的发现,跨越SW26010对忍者性能差距的关键步骤是设计一种促进算术强度的RLC友好算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号