首页> 外文期刊>Computer architecture news >Locality-Aware CTA Clustering for Modern GPUs
【24h】

Locality-Aware CTA Clustering for Modern GPUs

机译:适用于现代GPU的位置感知CTA群集

获取原文
获取原文并翻译 | 示例

摘要

Cache is designed to exploit locality; however, the role of on-chip Ll data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential — the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on Ll or Ll/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.
机译:缓存旨在利用本地性;然而,现代GPU上的片上L1数据高速缓存的作用通常很尴尬。来自不同SM(流多处理器)的全局内存请求之间的局部性主要是由共享的L2收集的,访问延迟长。内核内位置对性能交付至关重要,而内核内位置则由用户控制的暂存器存储器明确处理。在这项工作中,我们揭示了长期以来被忽略但具有提升性能潜力的另一种数据位置-CTA间位置。由于硬件可行性尚不明确,底层CTA调度程序未知且不可访问以及内核内高速缓存容量较小,因此利用此类位置非常具有挑战性。为了解决这些问题,我们首先对各种现代GPU进行了全面的经验探索,并证明可以在Ll或Ll / Tex统一缓存上在空间和时间上获取CTA间的局部性。通过进一步的量化过程,我们证明了GPU应用之间这种局部性的重要性和共性,并讨论了这种重用是否可被利用。通过利用这些见解,我们提出了CTA群集的概念及其相关的基于软件的技术,以重塑默认的CTA计划,以便将具有潜在重用的CTA分组在同一SM上。我们的技术不需要硬件修改,可以直接部署在现有的GPU上。此外,我们将这些技术整合到了一个集成的框架中,以实现自动CTA间本地化优化。我们在所有现代NVIDIA GPU架构上使用广泛流行的GPU应用程序评估技术。结果表明,我们提出的技术通过分别将Fermi,Kepler,Maxwell和Pascal的L2缓存事务平均减少55%,65%,29%,28%来显着提高缓存性能,从而导致平均1.46x,1.48针对与算法相关的CTA间重用的应用,性能提高了x,1.45x,1.41x(最高3.8x,3.6x,3.1x,3.3x)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号