首页> 外文期刊>International journal of parallel programming >Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs
【24h】

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

机译:高速缓存集成的网络接口:大规模CMP的灵活片上通信和同步

获取原文
获取原文并翻译 | 示例

摘要

Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds - the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
机译:每核暂存器存储器(或本地存储)允许直接进行核间通信,与基于一致性的基于缓存的通信相比,具有延迟和能源优势,尤其是在CMP体系结构变得更加分散时。我们设计了适用于可扩展多核的,集成了缓存的网络接口,结合了两个方面的优势-缓存的灵活性和暂存器的效率:片内SRAM可配置地在缓存,暂存器和虚拟化网络接口之间共享( NI)功能。本文介绍了我们的体系结构,该体系结构通过RDMA复制提供对单个单词或多单词块的本地和远程暂存器访问。此外,我们介绍了事件响应,作为一种启用软件可配置的通信和同步原语的技术。我们提供了三种事件响应机制,这些机制将NI功能暴露给软件,用于多字传输启动,针对软件选择的任意大小传输集的完成通知以及多方同步队列。我们在四核FPGA原型中实现了这些机制,并测量了仅用于高速缓存的设计的逻辑开销,以使NI的基本功能小于20%。我们还评估了原型上的片上通信性能,以及通过模拟具有多达128个内核的CMP的同步功能的性能。我们展示了高效的同步,低开销的通信和摊销的开销的批量传输,这些实现了细粒度任务的并行化收益以及对硬件带宽的有效利用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号