Locality based warp scheduling in GPGPUs

Yang Zhang; Zuocheng Xing; Cang Liu; Chuan Tang; Qinglin Wang

首页> 外文期刊>Future generation computer systems >Locality based warp scheduling in GPGPUs

【24h】

Locality based warp scheduling in GPGPUs

机译：GPGPU中基于位置的翘曲调度

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

As the need for high performance computing continues to grow, it becomes more and more urgent to design a massive multi-core processor with high throughput and efficiency. However, when the number of cores keeps increasing, the capacity of on-chip memory is always insufficient. In a multi-core processor such as GPGPU (General Purpose Graphic Processor Unit), dozens or hundreds of SMs (Stream Multi-processor) coordinate to gain high throughput with several MB on-chip memory. Furthermore, in one SM, thousands of threads are organized as thread blocks to process instructions in a SIMT (Single Instruction Multiple Threads) manner. As all the threads share the same on-chip memory, the mismatch between large core number and small on-chip memory capacity can easily impair the performance due to excessive thread contention for cache resource.An efficient thread scheduling method is a promising way to alleviate the problems and to boost performance. From the hardware perspective, the instructions are executed by warps which are made up by a fixed number of threads. So we propose a novel warp scheduling scheme to maintain data locality and to relieve cache pollution and thrashing issues. First, to make full use of time locality, we put the disordered warps into a supervised warp queue and issue the warps from oldest to youngest. To utilize space locality and to hide computation unit stalls, we put forward a new insertion method called LPI (Locality Protected Insertion) to reorder warps in the supervised warp queue to better hide long-latency warps with short-latency warps such as ALU operations and on-chip accesses. Over a wide variety of applications, the new scheduling method gains at most 10.1% and an average of 2.2% improvements over the baseline loose round-robin scheduling.

机译：随着对高性能计算的需求不断增长，设计具有高吞吐量和效率的大型多核处理器变得越来越紧迫。但是，当内核数量不断增加时，片上存储器的容量始终不足。在诸如GPGPU（通用图形处理器单元）之类的多核处理器中，数十个或数百个SM（流多处理器）协同工作以通过数MB的片上存储器获得高吞吐量。此外，在一个SM中，数千个线程被组织为线程块，以SIMT（单指令多线程）方式处理指令。由于所有线程共享同一个片上内存，因此内核数量大和片上小内存容量之间的不匹配会由于缓存资源的过多线程争用而轻易损害性能。有效的线程调度方法是缓解该问题的一种有前途的方法问题并提高性能。从硬件的角度来看，这些指令是由warp执行的，warp由固定数量的线程组成。因此，我们提出了一种新颖的翘曲调度方案，以维护数据局部性并缓解缓存污染和抖动问题。首先，为了充分利用时间上的局限性，我们将无序的经纱放入有监督的经纱队列中，并从最早的到最小的经纱进行发布。为了利用空间局部性并隐藏计算单元的停顿，我们提出了一种新的插入方法，称为LPI（局部性保护插入），以对有监督的扭曲队列中的扭曲进行重新排序，以更好地隐藏具有短延迟扭曲的长延迟扭曲，例如ALU操作和片上访问。在广泛的应用中，新的调度方法比基准松散循环调度最多获得10.1％的收益，平均提高2.2％。

著录项

来源
《Future generation computer systems》 |2018年第5期|520-527|共8页
作者
Yang Zhang; Zuocheng Xing; Cang Liu; Chuan Tang; Qinglin Wang;
展开▼
作者单位

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology;

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology;

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology;

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology;

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
GPGPU; Warp scheduling; Locality; Reordering;

机译：GPGPU;Warp调度;局部性;重新排序;

相似文献

外文文献
中文文献
专利

1. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads [J] . Shin-Ying Lee, Akhil Arunkumar, Carole-Jean Wu Computer architecture news . 2015,第3期

机译：CAWA：协调的翘曲调度和缓存优先级，用于GPGPU工作负载的关键翘曲加速
2. FRF: Toward Warp-Scheduler Friendly STT-RAM/SRAM Fine-Grained Hybrid GPGPU Register File Design [J] . Deng Quan, Zhang Youtao, Zhao Zhenyu, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems . 2020,第10期

机译：FRF：朝着经线调度器友好的STT-RAM / SRAM精细颗粒混合GPGPU注册文件设计
3. Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling [J] . Licheng Yu, Xingsheng Tang, Minghui Wu, Journal of systems architecture . 2014,第5期

机译：通过新的PDOM堆栈和多级翘曲调度提高GPGPU上的分支发散性能
4. VWS: A versatile warp scheduler for exploring diverse cache localities of GPGPU applications [C] . Mengjie Mao, Jingtong Hu, Yiran Chen, ACM/EDAC/IEEE Design Automation Conference . 2015

机译：VWS：多功能的warp调度程序，用于探索GPGPU应用程序的各种缓存位置
5. Predicting Critical Warps in Near-Threshold GPGPU Applications Using a Dynamic Choke Point Analysis [D] . Sanyal, Sourav. 2019

机译：使用动态扼流点分析预测近阈值GPGPU应用中的临界扭曲
6. Impact study of data locality on task-based applications through the Heteroprio scheduler [O] . Bérenger Bramas 2019

机译：通过Heteropro调度程序对基于任务的应用程序的影响研究
7. Predictive Warp Scheduling for Efficient Execution in GPGPU [O] . Abhinish Anand, Winnie Thomas, Suryakant Toraskar, 2021

机译：GPGPU高效执行的预测扭曲调度

Locality based warp scheduling in GPGPUs

摘要

著录项

相似文献

相关主题

期刊订阅