首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters
【24h】

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

机译:共享-L1-Memory多处理器集群的节能硬件加速同步

获取原文
获取原文并翻译 | 示例

摘要

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41x smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average.
机译:对互联网的终端节点(IOT)的高功率和能量受限处理系统(物联网)的陡峭增长的性能需求导致了近阈值计算(NTC),加入低的能量效率效益 - 使用典型的并行系统的性能进行操作。 Shared-L1-Memory多处理器集群是一个有前途的架构,以GOPS的顺序提供性能和超过100个GOP / W的能效。然而,只有通过最大化簇中可用的处理元件(PE)的有效利用率,才能达到这种计算效率。除此之外,PE-PE同步和通信的优化是性能的关键因素。在本文中,我们描述了一种重量轻的硬件加速同步和通信单元(SCU),用于紧密耦合的处理器集群。我们详细介绍了架构,可实现精细谷物的Per-PE电源管理,并将其集成到RISC-V处理器的八核集群中。为了验证所提出的解决方案的有效性,我们在高级22 NM FDX技术中实施了八核集群,并通过可调微磁盘和可调节微磁场和一组现实生活应用程序和内核进行了评估性能和节能。所提出的解决方案允许同步区域小于42周期,基于基于基于基线实现的41倍,基于在将微磁发布到10%同步开销时基于快速测试和设置对L1存储器的访问。当在现实生活DSP应用程序上进行评估时,拟议的SCU平均每平均高达92%和23%提高了92%和23%,平均水平高达98%和39%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号