首页> 外文会议>IEEE International Symposium on Circuits and Systems >A 0.42V high bandwidth synthesizable parallel access smart memory fabric for computer vision
【24h】

A 0.42V high bandwidth synthesizable parallel access smart memory fabric for computer vision

机译:用于计算机视觉的0.42V高带宽合成并行接入智能记忆面料

获取原文

摘要

We present a design of a 2 to 12 port scalable multiport compiler with simultaneous read port access and closely packed graphics integration capability specially designed for low power high bandwidth, low latency stream vector processors and machine learning applications. Novel pipe-lined decoder and bitline repeater insertion helps to achieve a fast cycle time. Memory words can be accessed in different ways, serial, parallel or mixed. A wide supply range from 0.4V to 1.1V is supported without any complex write or read assist circuit. Design is non-self-timed and fully testable while timing and power views are generated through a static timing analysis (STA) approach. Layout is based on automatic place and route of standard cells in periphery and full custom standard cell compatible high density memory core. Full custom core is tightly bound with the common graphics processing operations, to enable low latency (<; 1μs), high bandwidth operations at low voltage. Hybrid approach reduces the turn around time to just a few man weeks. Area penalty of a 2W2R 64 Kbit instance is up to 10% in comparison to a logic rule based full custom high speed 1W1R compiler, while doubling the throughput. Compared to complete RTL based synthesis approach, area is just 5% for 64 Kbit. A 2W2R 32×128 testchip instance in sub-20nm FinFET process, runs up-to 3 GHz on CAD at 1.1 V supply at -40 °C. While measured speed of same instance on silicon is 86 MHz (at 0.42 V) for simultaneous access from both the ports and energy consumed is just 5 pJ/cycle in typical process corner. Architecture is scalable up to 64KB for more parallel architectures (64 cores) as demanded in ultra-high definition real time computational photography [1].
机译:我们为2至12个端口可伸缩多端口编译器的设计,具有同时读取端口访问和紧密的图形集成功能,专为低功耗高带宽,低延迟流矢量处理器和机器学习应用而设计。新型管道解码器和位线中继器插入有助于实现快速循环时间。可以以不同的方式访问内存单词,串行,并行或混合。在没有任何复杂的写入或读取辅助电路的情况下,支持宽的电源范围为0.4V至1.1V。通过静态定时分析(STA)方法生成时序和电源视图,设计是非自定时和完全可测试的。布局是基于外围和全定制标准单元兼容高密度存储器核心的标准单元的自动位置和路径。完整的自定义核心与公共图形处理操作紧密绑定,以实现低延迟(<;1μs),低电压的高带宽操作。混合方法将转弯时间减少到几个人几周。与基于逻辑规则的全定制高速1W1R编译器相比,2W2R 64 Kbit实例的区域惩罚高达10%,同时将吞吐量加倍。与完整的RTL基合成方法相比,面积仅为64 kbit。 SUB-20NM FinFET过程中的2W2R 32×128 Testchip实例,在-40°C下,CAD上的CAD运行高达3 GHz。虽然硅上同一实例的测量速度为86 MHz(0.42 V),用于同时从销料和能量的同时访问仅为5 PJ /循环在典型的过程角落。适用于超高清实时计算摄影中所要求的更多并行架构(64个核心),架构可扩展到64KB。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号