【24h】

Breaking the bandwidth wall in chip multiprocessors

机译:突破芯片多处理器的带宽壁垒

获取原文

摘要

In throughput-aware CMPs like GPUs and DSPs, software-managed streaming memory systems are an effective way to tolerate high latencies. E.g., the Cell/B.E. incorporates local memories, and data transfers to/from those memories are overlapped with computation using DMAs. In such designs, the latency of the memory system has little impact on performance; instead, memory bandwidth becomes critical. With the increase in the number of cores, conventional DRAMs no longer suffice to satisfy the bandwidth demand. Hence, recent throughput-aware CMPs adopted caches to filter off-chip traffic. However, such caches are optimized for latency, not bandwidth. This work presents a re-design of the memory system in throughput-aware CMPs. Instead of a traditional latency-aware cache, we propose to spread the address space using fine-grained interleaving all over a shared non-coherent last-level cache (LLC). In this way, on-chip storage is optimally used, with no need to keep coherency. On the memory side, we also propose the use of interleaving across DRAMs but with a much finer granularity than usual page-size approaches. Our proposal is highly optimized for bandwidth, not latency, by avoiding data replication in the LLC and by using fine-grained address space interleaving in both the LLC and the memory. For a CMP with 128 cores and 64-MB LLC, performance is improved by 21% due to the LLC optimizations and an extra 42% due to the off-chip memory optimizations, for a total 1.7 times performance improvement.
机译:在诸如GPU和DSP之类的具有吞吐量意识的CMP中,软件管理的流存储系统是忍受高延迟的有效方法。例如,Cell / B.E。包括本地存储器,并且去往/来自那些存储器的数据传输与使用DMA的计算重叠。在这样的设计中,存储系统的延迟对性能几乎没有影响。相反,内存带宽变得至关重要。随着核数目的增加,常规DRAM不再足以满足带宽需求。因此,最近的具有吞吐量意识的CMP采用高速缓存来过滤片外流量。但是,此类缓存针对延迟(而非带宽)进行了优化。这项工作提出了吞吐量识别型CMP中存储系统的重新设计。代替传统的可感知延迟的缓存,我们建议使用细粒度的交织来扩展地址空间,这些交织遍及共享的非一致性最后一级缓存(LLC)。这样,就可以最佳地使用片上存储,而无需保持一致性。在内存方面,我们还建议使用跨DRAM的交错方式,但粒度要比通常的页面大小方法好得多。我们的建议针对带宽(而不是延迟)进行了高度优化,这是通过避免LLC中的数据复制以及通过在LLC和内存中使用细粒度的地址空间交错来实现的。对于具有128核和64 MB LLC的CMP,由于LLC的优化,性能提高了21%,由于片外存储器的优化,性能提高了42%,总共提高了1.7倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号