Locality-Aware CTA Clustering for Modern GPUs

Ang Li; Shuaiwen Leon Song; Weifeng Liu; Xu Liu; Akash Kumar; Henk Corporaal

首页> 外文期刊>Computer architecture news >Locality-Aware CTA Clustering for Modern GPUs

【24h】

Locality-Aware CTA Clustering for Modern GPUs

机译：适用于现代GPU的位置感知CTA群集

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Cache is designed to exploit locality; however, the role of on-chip Ll data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential — the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on Ll or Ll/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.

机译：缓存旨在利用本地性；然而，现代GPU上的片上L1数据高速缓存的作用通常很尴尬。来自不同SM（流多处理器）的全局内存请求之间的局部性主要是由共享的L2收集的，访问延迟长。内核内位置对性能交付至关重要，而内核内位置则由用户控制的暂存器存储器明确处理。在这项工作中，我们揭示了长期以来被忽略但具有提升性能潜力的另一种数据位置-CTA间位置。由于硬件可行性尚不明确，底层CTA调度程序未知且不可访问以及内核内高速缓存容量较小，因此利用此类位置非常具有挑战性。为了解决这些问题，我们首先对各种现代GPU进行了全面的经验探索，并证明可以在Ll或Ll / Tex统一缓存上在空间和时间上获取CTA间的局部性。通过进一步的量化过程，我们证明了GPU应用之间这种局部性的重要性和共性，并讨论了这种重用是否可被利用。通过利用这些见解，我们提出了CTA群集的概念及其相关的基于软件的技术，以重塑默认的CTA计划，以便将具有潜在重用的CTA分组在同一SM上。我们的技术不需要硬件修改，可以直接部署在现有的GPU上。此外，我们将这些技术整合到了一个集成的框架中，以实现自动CTA间本地化优化。我们在所有现代NVIDIA GPU架构上使用广泛流行的GPU应用程序评估技术。结果表明，我们提出的技术通过分别将Fermi，Kepler，Maxwell和Pascal的L2缓存事务平均减少55％，65％，29％，28％来显着提高缓存性能，从而导致平均1.46x，1.48针对与算法相关的CTA间重用的应用，性能提高了x，1.45x，1.41x（最高3.8x，3.6x，3.1x，3.3x）。

著录项

来源
《Computer architecture news》 |2017年第1期|297-311|共15页
作者
Ang Li; Shuaiwen Leon Song; Weifeng Liu; Xu Liu; Akash Kumar; Henk Corporaal;
展开▼
作者单位

Pacific Northwest National Lab;

Pacific Northwest National Lab;

University of Copenhagen;

College of William and Mary;

Technische Universitat Dresden;

Eindhoven University of Technology;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
GPU; CTA; cache locality; performance optimization; runtime tool;

机译：GPU;CTA;缓存位置;性能优化;运行时工具;

相似文献

外文文献
中文文献
专利

1. Locality-Aware CTA Clustering for Modern GPUs [J] . Li Ang, Song Shuaiwen Leon, Liu Weifeng, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2017,第4期

机译：现代GPU的位置感知CTA集群
2. Toward high-performance key-value stores through GPU encoding and locality-aware encoding [J] . Dongfang Zhao, Ke Wang, Kan Qiao, Journal of Parallel and Distributed Computing . 2016,第octa期

机译：通过GPU编码和位置感知编码实现高性能键值存储
3. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives [J] . Jose M. Andion, Manuel Arenaz, Francois Bodin, International journal of parallel programming . 2016,第3期

机译：具有OpenHMPP指令的GPGPU的位置感知自动并行化
4. Locality-Aware Task-Parallel Execution on GPUs [C] . Jad Hbeika, Milind Kulkarni International Workshop on Languages and Compilers for Parallel Computing . 2017

机译：GPU上的位置感知任务并行执行
5. Automatic transformation and optimization of applications on GPUs and GPU clusters. [D] . Ma, Wenjing. 2011

机译：在GPU和GPU群集上自动转换和优化应用程序。
6. A Flexible Hybrid BCH Decoder for Modern NAND Flash Memories Using General Purpose Graphical Processing Units (GPGPUs) [O] . Arul Subbiah, Tokunbo Ogunfunmi 2019

机译：使用通用图形处理单元（GPGPU）的现代NAND闪存的灵活混合BCH解码器
7. A Study of the Potential of Locality-Aware Thread Scheduling for GPUs [O] . Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal 2014

机译：GPU位置感知线程调度的潜力研究

Locality-Aware CTA Clustering for Modern GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅