Breaking the bandwidth wall in chip multiprocessors

机译：突破芯片多处理器的带宽壁垒

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In throughput-aware CMPs like GPUs and DSPs, software-managed streaming memory systems are an effective way to tolerate high latencies. E.g., the Cell/B.E. incorporates local memories, and data transfers to/from those memories are overlapped with computation using DMAs. In such designs, the latency of the memory system has little impact on performance; instead, memory bandwidth becomes critical. With the increase in the number of cores, conventional DRAMs no longer suffice to satisfy the bandwidth demand. Hence, recent throughput-aware CMPs adopted caches to filter off-chip traffic. However, such caches are optimized for latency, not bandwidth. This work presents a re-design of the memory system in throughput-aware CMPs. Instead of a traditional latency-aware cache, we propose to spread the address space using fine-grained interleaving all over a shared non-coherent last-level cache (LLC). In this way, on-chip storage is optimally used, with no need to keep coherency. On the memory side, we also propose the use of interleaving across DRAMs but with a much finer granularity than usual page-size approaches. Our proposal is highly optimized for bandwidth, not latency, by avoiding data replication in the LLC and by using fine-grained address space interleaving in both the LLC and the memory. For a CMP with 128 cores and 64-MB LLC, performance is improved by 21% due to the LLC optimizations and an extra 42% due to the off-chip memory optimizations, for a total 1.7 times performance improvement.

机译：在诸如GPU和DSP之类的具有吞吐量意识的CMP中，软件管理的流存储系统是忍受高延迟的有效方法。例如，Cell / B.E。包括本地存储器，并且去往/来自那些存储器的数据传输与使用DMA的计算重叠。在这样的设计中，存储系统的延迟对性能几乎没有影响。相反，内存带宽变得至关重要。随着核数目的增加，常规DRAM不再足以满足带宽需求。因此，最近的具有吞吐量意识的CMP采用高速缓存来过滤片外流量。但是，此类缓存针对延迟（而非带宽）进行了优化。这项工作提出了吞吐量识别型CMP中存储系统的重新设计。代替传统的可感知延迟的缓存，我们建议使用细粒度的交织来扩展地址空间，这些交织遍及共享的非一致性最后一级缓存（LLC）。这样，就可以最佳地使用片上存储，而无需保持一致性。在内存方面，我们还建议使用跨DRAM的交错方式，但粒度要比通常的页面大小方法好得多。我们的建议针对带宽（而不是延迟）进行了高度优化，这是通过避免LLC中的数据复制以及通过在LLC和内存中使用细粒度的地址空间交错来实现的。对于具有128核和64 MB LLC的CMP，由于LLC的优化，性能提高了21％，由于片外存储器的优化，性能提高了42％，总共提高了1.7倍。

著录项

来源
《2011 International Conference on Embedded Computer Systems : Architectures, Modeling and Simulation》|2011年|p.255-262|共8页
会议地点
作者
Vega Augusto; Cabarcas Felipe; Ramirez Alex; Valero Mateo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors [J] . Abdullah Kayi, Olivier Serres, Tarek El-Ghazawi International journal of parallel programming . 2014,第3期

机译：芯片多处理器的带宽自适应高速缓存一致性优化
2. Studying the Impact of Hardware Prefetching and Bandwidth Partitioning in Chip-Multiprocessors * [J] . Fang Liu, Yan Solihin Performance evaluation review . 2011,第1期

机译：研究芯片多处理器中硬件预取和带宽划分的影响*
3. Chip breaking: learn from your chips - Gather data from your turning operation by examining chip size, color [J] . Joe Thompson Canadian Industrial Machinery . 2014,第7期

机译：断屑：从切屑中学习-通过检查切屑尺寸，颜色来收集车削操作中的数据
4. Breaking the bandwidth wall in chip multiprocessors [C] . Vega Augusto, Cabarcas Felipe, Ramirez Alex, International Conference on Embedded Computer Systems . 2011

机译：打破芯片多处理器中的带宽墙
5. Improving the off-chip bandwidth utilization and energy efficiency in chip multiprocessor (CMP) architectures. [D] . Al-Tarawneh, Mutaz. 2010

机译：提高芯片多处理器（CMP）架构中的芯片外带宽利用率和能效。
6. Breaking the Third Wall: Implementing 3D-Printing Techniques to Expand the Complexity and Abilities of Multi-Organ-on-a-Chip Devices [O] . Yoel Goldstein, Sarah Spitz, Keren Turjeman, 2021

机译：打破第三墙：实现3D打印技术以扩大多器件芯片设备的复杂性和能力
7. Conserving memory bandwidth in chip multiprocessors with runahead execution [O] . Martin Karlsson, Erik Hagersten 2007

机译：通过先行执行来节省芯片多处理器中的内存带宽

Breaking the bandwidth wall in chip multiprocessors

摘要

著录项

相似文献

相关主题

期刊订阅