首页> 外文会议>International Symposium on Microarchitecture >Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays
【24h】

Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays

机译:多线程中的线程间通信,可重新配置的粗粒阵列

获取原文

摘要

Traditional von Neumann GPGPUs only allow threads to communicate through memory on a group-to-group basis. In this model, a group of producer threads writes intermediate values to memory, which are read by a group of consumer threads after a barrier synchronization. To alleviate the memory bandwidth imposed by this method of communication, GPGPUs provide a small scratchpad memory that prevents intermediate values from overloading DRAM bandwidth. In this paper we introduce direct inter-thread communications for massively multithreaded CGRAs, where intermediate values are communicated directly through the compute fabric on a point-to-point basis. This method avoids the need to write values to memory, eliminates the need for a dedicated scratchpad, and avoids workgroup global barriers. We introduce our proposed extensions to the programming model (CUDA) and execution model, as well as the hardware primitives that facilitate the communication. Our simulations of Rodinia benchmarks running on the new system show that direct inter-thread communication provides an average speedup of 2.8x (10.3x max) and reduces system power by an average of 5x (22x max), when compared to an equivalent Nvidia GPGPU.
机译:传统的von neumann gpgpus仅允许线程通过存储器对组进行通信。在此模型中,一组生产者线程将中间值写入存储器,在屏障同步后由一组消费者线程读取。为了缓解这种通信方法所强加的内存带宽,GPGPU提供了一个小的刻痕存储器,可防止中间值过载DRAM带宽。在本文中,我们向大规模多线程CGRA引入直接的线程通信,其中中间值直接通过计算结构在点对点的基础上传通。此方法避免需要将值写入内存,从而消除了对专用暂存器的需求,避免了工作组全局障碍。我们向编程模型(CUDA)和执行模型以及促进通信的硬件原语来介绍我们提出的扩展。我们对新系统运行的rodinia基准测试的模拟表明,直接线程通信提供2.8倍(最大值)的平均速度,并将系统功率降低为5倍(22倍最大),与等效的NVIDIA GPGPU相比。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号