首页> 外文会议>IEEE International Symposium on High Performance Computer Architecture >Stream Floating: Enabling Proactive and Decentralized Cache Optimizations
【24h】

Stream Floating: Enabling Proactive and Decentralized Cache Optimizations

机译:流浮动:启用主动和分散的缓存优化

获取原文

摘要

As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage.
机译:随着多核系统的持续增长和片上存储容量,片上网络带宽和延迟成为有问题的瓶颈。因此,数据传输中的开销,一致性协议和更换政策变得越来越重要。不幸的是,即使在结构良好的程序中,由于传统缓存层次结构的反应性和集中性,许多自然优化难以实现,其中所有请求由核心短,高速缓存行粒度访问启动。例如,可以从共享缓存中流式流式传输长持久的访问模式,而不从核心请求。可以通过缓存内的链接请求来执行间接存储器访问,而不是不断返回到核心。我们的主要识别是,如果程序可以在其ISA中嵌入有关长期内存流行为的信息,则这些流可以浮动到适当的内存层级级别。通过主动地在核心之前甚至请求之前,这种分散的解决生成和缓存请求的方法可以通过主动发送数据来导致更好的缓存策略和更低的请求和数据流量。为了评估流浮动的机会,我们可以增强具有流引擎的瓷砖多核缓存层次结构,以在最后级缓存库中处理流请求。我们开发了几种新颖的优化,通过ISA中的流暴露,随后暴露于高速缓存。我们使用来自Rodinia的10个数据处理工作负载和在OpenMP中编写的2个流核,使用来自循环级执行驱动的GEM5的模拟器进行评估。我们发现,流漂浮的流浮动,分别具有52%和39%的加速,具有艺术预取器设计状态,能效优势64%和49%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号