首页> 外文会议>37th annual international symposium on computer architecture 2010 >Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance
【24h】

Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance

机译:集成分支和内存发散容差的动态经细分

获取原文
获取原文并翻译 | 示例

摘要

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely costly. Multi-threading can hide latencies by interleaving the execution of multiple warps, but deep multi-threading using many warps dramatically increases the cost of the register files (multi-threading depth x SIMD width), and cache contention can make performance worse. Instead, intra-warp latency hiding should first be exploited. This allows threads that are ready but stalled by SIMD restrictions to use these idle cycles and reduces the need for multi-threading among warps. This paper introduces dynamic warp subdivision (DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register file space. Independent scheduling entities allow divergent branch paths to interleave their execution, and allow threads that hit to run ahead. The result is improved latency hiding and memory level parallelism (MLP). We evaluate the technique on a coherent cache hierarchy with private LI caches and a shared L2 cache. With an area overhead of less than 1%, experiments with eight data-parallel benchmarks show our technique improves performance on average by 1.7X.
机译:SIMD组织分摊跨多个处理单元的获取,解码和发布逻辑的面积和功率,以便在给定面积和功率预算下最大化吞吐量。但是,当由于长时间等待内存访问而使以锁步(扭曲)方式运行的一组线程停止时,吞吐量会降低。产生的空转周期非常昂贵。多线程可以通过交错执行多个扭曲来隐藏等待时间,但是使用多个扭曲的深度多线程会大大增加寄存器文件的成本(多线程深度x SIMD宽度),并且缓存争用会使性能变差。相反,应首先利用经纱内延迟隐藏。这使准备就绪但由于SIMD限制而停滞的线程可以使用这些空闲周期,并减少了线程束之间对多线程的需求。本文介绍了动态扭曲细分(DWS),它允许单个扭曲在调度程序中占用多个插槽,而无需额外的寄存器文件空间。独立的调度实体允许不同的分支路径交错执行,并允许命中的线程提前运行。结果是改进了延迟隐藏和内存级别并行性(MLP)。我们在具有私有LI缓存和共享L2缓存的一致缓存层次结构上评估该技术。在不到1%的区域开销的情况下,使用八个数据并行基准进行的实验表明,我们的技术将性能平均提高了1.7倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号