首页> 外文期刊>Computer architecture news >Orchestrated Scheduling and Prefetching for GPGPUs
【24h】

Orchestrated Scheduling and Prefetching for GPGPUs

机译:GPGPU的调度计划和预取

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close to the time their corresponding addresses are actually demanded by another warp, or 2) requires sophisticated prefetcher designs to correctly predict the addresses required by a future "far-ahead" warp while executing the current warp. We propose a new prefetch-aw are warp scheduling policy that overcomes these problems. The key idea is to separate in time the scheduling of consecutive warps such that they are not executed back-to-back. We show that this policy not only enables a simple prefetcher to be effective in tolerating memory latencies but also improves memory bank parallelism, even when prefetching is not employed. Experimental evaluations across a diverse set of applications on a 30-core simulated GPGPU platform demonstrate that the prefetch-aware warp scheduler provides 25% and 7% average performance improvement over baselines that employ prefetching in conjunction with, respectively, the commonly-employed round-robin scheduler or the recently-proposed two-level warp scheduler. Moreover, when prefetching is not employed, the prefetch-aware warp scheduler provides higher performance than both of these baseline schedulers as it better exploits memory bank parallelism.
机译:在本文中,我们提出了在通用图形处理单元(GPGPU)架构中协调线程调度和预取决策的技术,以更好地容忍长存储延迟。我们证明了GPGPU架构中现有的翘曲调度策略无法有效地合并数据预取。主要原因是,它们调度连续的扭曲,这些扭曲很可能访问附近的缓存块,因此可以在连续的周期中背对背准确地相互预取。这可能是1)导致预取产生的翘曲过于接近另一个翘曲实际要求其相应地址的时间,或者2)需要复杂的预取器设计来正确预测将来的“远距离”翘曲所需的地址,而执行当前的扭曲。我们提出了一种克服这些问题的新的预取-翘曲翘曲调度策略。关键思想是及时分离连续翘曲的调度,以使它们不会背对背执行。我们表明,该策略不仅使简单的预取器可以有效地容忍内存延迟,而且即使不使用预取功能,也可以提高存储库并行度。在30核模拟GPGPU平台上对各种应用程序进行的实验评估表明,预取感知型翘曲调度程序比使用预取结合通常使用的通用回合基准的基准性能分别提高了25%和7%。罗宾调度程序或最近提出的两级翘曲调度程序。而且,当不使用预取时,预取可知的warp调度程序将比这两个基线调度程序提供更高的性能,因为它更好地利用了存储体并行性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号