首页> 外文会议>IEEE/ACM International Symposium on Networks-on-Chip >Single-cycle collective communication over a shared network fabric
【24h】

Single-cycle collective communication over a shared network fabric

机译:共享网络结构上的单周期集体通信

获取原文

摘要

In the multicore era, on-chip network latency and throughput have a direct impact on system performance. A highly important class of communication flows traversing the network is collective, i.e., one-to-many and many-to-one. Scalable coherence protocols often leverage imprecise tracking to lower the overhead of directory storage, in turn leading to more collective communications on-chip. Routers with support for message forking/aggregation have been previously demonstrated, supporting such protocols. However, even with the fastest possible designs today (1-cycle routers), collective flows on a k×k mesh still incur delays proportional to k since all communication is across the entire chip. As k increases across technology generations, the latency of these flows will also go up. However, the pure wire delay to cross the chip is just 1-2 cycles today, and is expected to remain roughly invariant. The dependence of message delays on k arises due to the requirement to latch messages at every router. In this work, we remove this requirement.We design a network fabric that enables messages to (1) dynamically create virtual 1-to-Many (multicast) and Many-to-1 (reduction) tree routes over a physical mesh, (2) get forked/aggregated at nodes on the tree, and (3) traverse the tree - all within a single-cycle across each dimension. For synthetic 1-to-Many/Many-to-1 flows, we demonstrate 76/82% reduction in latency, and 1.6/2X improvement in throughput over a state-of-the-art NoC with 1-cycle routers and support for collective communication. Across a suite of SPLASH-2 and PARSEC benchmarks, full-system runtime and energy is reduced by 14% and 50% for a limited-directory protocol.
机译:在多核时代,片上网络延迟和吞吐量直接影响系统性能。遍历网络的非常重要的一类通信流是集体的,即一对多和多对一。可扩展的一致性协议通常利用不精确的跟踪来降低目录存储的开销,从而导致更多的片上集体通信。先前已经演示了支持消息派生/聚合的路由器,它支持此类协议。但是,即使采用当今最快的设计(1周期路由器),由于所有通信都跨越整个芯片,因此k×k网格上的集合流仍然会产生与k成比例的延迟。随着技术世代之间k的增加,这些流的等待时间也会增加。但是,今天穿越芯片的纯线延迟仅为1-2个周期,并且预计将保持大致不变。消息延迟对k的依赖性是由于要求在每个路由器上锁存消息而引起的。在这项工作中,我们消除了这一要求。我们设计了一种网络结构,该结构使消息能够(1)在物理网格上动态创建虚拟的一对多(多播)和多对一(减少)树路由,(2 )在树上的节点处进行分叉/聚集,然后(3)遍历树-所有这些都在每个维度的单个周期内进行。对于合成的“一对多/一对多”流,与具有1周期路由器并支持的最先进的NoC相比,我们证明了延迟减少了76/82%,吞吐量提高了1.6 / 2倍。集体交流。在一系列SPLASH-2和PARSEC基准测试中,对于有限目录协议,整个系统的运行时间和能源减少了14%和50%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号