Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor

机译：SW26010在SW26010多核处理器上的两个计算齐全的科学核优化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The home-grown SW26010 many-core processor enabled the production of China's first independently developed number-one ranked supercomputer - the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication" (RLC), to share register data among its 8 × 8 computing processing elements (CPEs) via a 2D on-chip network. Such a radical architecture has sparked global researchers' concerns regarding the programming challenges this may cause. To address these concerns, we adopted two compute-bound scientific kernels as benchmarks to identify the potential programming challenges. The first kernel is double-precision general matrix-multiplication (DGEMM). An RLC-friendly algorithm was designed for this kernel to reuse the data that already reside in the registers of 64 CPEs. This novel optimization enables the kernel to achieve up to 88.7% efficiency in one core group of the SW26010. This paper reveals, for the first time, the details of how the highly efficient DGEMM is implemented on the home-grown processor. The second kernel that we used is N-body. Due to the inefficient hardware support for transcendental operations on the SW26010, we replaced the reciprocal square root (rsqrt) instruction of N-body with a software routine to tackle the problem. Based on the programming challenges identified through these two optimized kernels, we proposed a three-level programming guideline for the SW26010. The paper concludes with our crucial finding that the critical step towards bridging the ninja performance gap on the SW26010 is to design an RLC-friendly algorithm to increase arithmetic intensity.

机译：本土SW26010多核处理器使中国首次独立开发的号码排名超级计算机 - Sunway Toinghulight。然而，有限的离上内存带宽的设计使SW26010成为高度内存的处理器。为了补偿此限制，处理器的设计具有独特的硬件特征，“寄存器级通信”（RLC），以通过2D片上网络共享其8×8计算处理元件（CPE）之间的寄存器数据。这种激进的建筑引发了全球研究人员对这可能导致的编程挑战的担忧。为解决这些问题，我们通过了两个计算束缚的科学内核作为基准，以确定潜在的编程挑战。第一个内核是双精度通用矩阵乘法（DGEMM）。为此内核设计了RLC友好的算法，以重用已驻留在64个CPE的寄存器中的数据。这种新颖的优化使得内核能够在SW26010的一个核心组中获得高达88.7％的效率。本文首次揭示了高效DGEMM如何在本土处理器上实施的细节。我们使用的第二个内核是n-body。由于对SW26010对超越操作的低效硬件支持，我们用软件例程替换了N-MOD的互易方形根（RSQRT）指令以解决问题。基于通过这两个优化内核识别的编程挑战，我们提出了SW26010的三级编程指南。本文得出结论，我们至关重要的发现，跨越SW26010对忍者性能差距的关键步骤是设计一种促进算术强度的RLC友好算法。

著录项

来源
《International Conference on Parallel Processing》|2017年|603p|共10页
会议地点
作者
James Lin; Zhigeng Xu; Akira Nukada; Naoya Maruyama; Satoshi Matsuoka;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.133.2-53;
关键词
TaihuLight; SW26010; DGEMM; N-body; Register level communication; Performance optimization;

机译：Tailulight;SW26010;DGEMM;N-BOLD;寄存器级通信;性能优化;

相似文献

外文文献
中文文献
专利

1. Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations [J] . James Lin, Zhigeng Xu, Linjin Cai, Parallel Computing . 2018,第SEPa期

机译：使用微基准套件评估SW26010多核处理器以优化性能
2. Efficient parallelization of multilevel fast multipole algorithm for electromagnetic simulation on many-core SW26010 processor [J] . He Wei-Jia, Yang Ming-Lin, Wang Wu, Journal of supercomputing . 2021,第2期

机译：多级快速多极算法对多核SW26010处理器电磁仿真的高效并行化
3. UNAT: UNstructured Acceleration Toolkit on SW26010 many-core processor [J] . Liu Hongbin, Ren Hu, Gu Hanfeng, Engineering Computations . 2020,第9期

机译：UNAT：SW26010上的非结构化加速工具包数量多核处理器
4. Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor [C] . James Lin, Zhigeng Xu, Akira Nukada, International Conference on Parallel Processing . 2017

机译：SW26010在SW26010多核处理器上的两个计算齐全的科学核优化
5. On implementation and optimization of large-data scientific kernels on multicore processors and GPUs [D] . Hakeem, Mohammad Umar 2013

机译：在多核处理器和GPU上实现和优化大数据科学内核
6. Generation and optimization of superpixels as image processing kernels for Jones matrix optical coherence tomography [O] . Arata Miyazawa, Young-Joo Hong, Shuichi Makita, 2017

机译：超像素的生成和优化作为琼斯矩阵光学相干断层扫描的图像处理内核
7. Parallel Implementation and Optimization of Regional Ocean Modeling System (ROMS) Based on Sunway SW26010 Many-Core Processor [O] . Tao Liu, Yuan Zhuang, Min Tian, 2019

机译：基于Sunway SW26010的区域海洋建模系统（ROM）的平行实施与优化许多核心处理器

Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor

摘要

著录项

相似文献

相关主题

期刊订阅