首页> 外文OA文献 >Design of a Snoop Filter for Snoop Based Cache Coherency Protocols
【2h】

Design of a Snoop Filter for Snoop Based Cache Coherency Protocols

机译:基于探听的缓存一致性协议的探听过滤器的设计

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Multi core architectures has become common in mobile SoCs; not only for CPUs, but also for mobile GPUs. With the introduction of OpenCl for mobile GPU architecture, the SoCs are able to become more powerful than before. Because programs that were executed on the CPU before, can now be executed faster on the GPU. Along with this the need for cache coherence protocols has also been introduced. Snoop based cache coherence protocols inherently leads to extensive coherence traffic on the bus in a multi core system. All this traffic leads to tag lookups in remote data caches. However, recent research shows that these lookups and coherency traffic, are by a large extent unnecessary. In other words a lot of power is wasted by transmitting unnecessary snoop requests over a interconnect. This project has explored one possible solution to reducing these requests: Snoop Filters.Previous research has been done for CPUs with SPLASH and other benchmark suits. This thesis however, will look at coherence transactions and lookups from a GPU perspective. To be able to thoroughly analyze coherence transactions from OpenCl benchmarks, a parameterizable multi core model has been constructed. The model is capable of replaying OpenCl benchmarks after executing on a ARM-MALI T6xx GPU. The results show that similarly to CPU benchmarks, the coherency traffic induced by OpenCl benchmark also end up in cache misses. Recent research also shows that CPU coherency protocols using the MESI states, instead of just MSI, reduces the unwanted coherency traffic. The reduction is so big that much that snoop filters and other coherence limiting approaches were unnecessary. The research done for this thesis has shown that this is not the case for GPUs, as the MESI protocol does not reduce power consumption in a multi core GPU. Because of this, snoop filters based on the CSR(Ranganathan 2012) filter were explored. The analysis in this thesis of the original destination based CSR filter, showed that the filter reduced the unnecessary tag lookups to around 53% for the OpenCl benchmarks. This means a great underlying potential in how the resources are selected according to the address stream. The analysis also showed that a fair deal of the snoop induced transactions also were unnecessary.Based on the filter analysis two new filters were designed: Source-CSRHashed-index CSRAlthough source CSR represents more hardware overhead compared to the destination filter, it is capable of reducing 30% of the snoop transactions. The source filter is also capable of a 53% reduction of the tag lookups. The hashed index filter was inspired by the potential in reducing the tag lookups further than 53%. The filter was capable of a 56% reduction. Although this is only 3% improvement over the normal filter, the filter performed remarkably for a number of benchmarks. Unfortunately this was not the case for all benchmarks. It shows that dynamic allocation of filter resources is capable of further reduction. The best case scenario would have been to use the original resource selection on some of the benchmarks, and the hashed index system on the others.The source filter was also implemented in Verilog HDL, and formally verified in JasperGold using SystemVerilog. The filter was supposed to be power simulated, but some unknown error in the switching activity conversion halted any further power estimation. A proper conclusion about the power saving potential for the source filter can therefore not be made. This thesis does however include a power estimation methodology in order for the power estimation to be completed in the future.
机译:多核架构已在移动SoC中变得很普遍。不仅适用于CPU,而且适用于移动GPU。随着针对移动GPU架构的OpenCl的推出,SoC变得比以前更强大。因为以前在CPU上执行过的程序现在可以在GPU上更快地执行。随之而来的是对缓存一致性协议的需求。基于探听的缓存一致性协议本质上导致了多核系统中总线上的大量一致性流量。所有这些流量导致在远程数据缓存中进行标签查找。但是,最近的研究表明,这些查找和一致性通信在很大程度上是不必要的。换句话说,通过在互连上传输不必要的监听请求会浪费大量功率。该项目探索了一种减少这些需求的可能解决方案:探听过滤器。先前对具有SPLASH和其他基准套件的CPU进行了研究。但是,本文将从GPU的角度研究一致性事务和查找。为了能够从OpenCl基准中彻底分析一致性事务,已构建了可参数化的多核模型。在ARM-MALI T6xx GPU上执行后,该模型能够重播OpenCl基准测试。结果表明,与CPU基准测试类似,由OpenCl基准测试引起的一致性流量也最终会导致高速缓存未命中。最近的研究还表明,使用MESI状态(而不只是MSI)的CPU一致性协议可以减少不必要的一致性流量。减少幅度如此之大,以至于不需要侦听滤波器和其他相干性限制方法。对本文所做的研究表明,GPU并非如此,因为MESI协议不会降低多核GPU中的功耗。因此,探索了基于CSR(Ranganathan 2012)过滤器的探听过滤器。本文对基于原始目标的CSR过滤器的分析表明,对于OpenCl基准,该过滤器将不必要的标签查找减少到了53%。这意味着根据地址流选择资源的巨大潜在潜力。分析还表明,也没有必要进行大量的探听诱发的事务。基于筛选器分析,设计了两个新的筛选器:源CSRH哈希索引CSR尽管源CSR比目标筛选器代表了更多的硬件开销,但它能够减少30%的监听交易。源过滤器还能够将标签查找减少53%。哈希索引过滤器的灵感来自将标签查找减少到53%以上的潜力。过滤器的过滤能力降低了56%。尽管这仅比普通过滤器提高了3%,但该过滤器在许多基准测试中的表现都非常出色。不幸的是,并非所有基准测试都如此。它表明动态分配过滤器资源可以进一步减少。最好的情况是在某些基准上使用原始资源选择,在其他基准上使用散列索引系统。源过滤器也在Verilog HDL中实现,并在JasperGold中使用SystemVerilog进行了正式验证。该滤波器原本应该进行功率仿真,但是开关活动转换中的一些未知错误使任何进一步的功率估算都无法进行。因此,无法就源滤波器的节能潜力做出正确的结论。然而,本文确实包括一种功率估计方法,以便将来完成功率估计。

著录项

  • 作者

    Ulfsnes Rasmus;

  • 作者单位
  • 年度 2013
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号