【24h】

High-Performance RMA-Based Broadcast on the Intel SCC

机译:英特尔SCC上基于RMA的高性能广播

获取原文
获取原文并翻译 | 示例

摘要

Many-core chips with more than 1000 cores are expected by the end of the decade. To overcome scalability issues related to cache coherence at such a scale, one of the main research directions is to leverage the message-passing programming model. The Intel Single-Chip Cloud Computer (SCC) is a prototype of a message-passing many-core chip. It offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). Performance of message-passing applications is directly affected by efficiency of collective operations, such as broadcast. In this paper, we study how to make use of the MPBs to implement an efficient broadcast algorithm for the SCC. We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Using a LogP-based model, we present an analytical evaluation that compares our algorithm to the state-of-the-art broadcast algorithms implemented for the SCC. As predicted by the model, experimental results show that OC- Bcast attains almost three times better throughput, and improves latency by at least 27%. Furthermore, the analytical evaluation highlights the benefits of our approach: OC-Bcast takes direct advantage of RMA, unlike the other considered broadcast algorithms, which are based on a higher-level send/receive interface. This leads us to the conclusion that RMA-based collective operations are needed to take full advantage of hardware features of future message-passing many- core architectures.
机译:到本世纪末,预计将拥有1000多个核的多核芯片。为了以这种规模克服与缓存一致性相关的可伸缩性问题,主要研究方向之一是利用消息传递编程模型。英特尔单芯片云计算机(SCC)是消息传递多核芯片的原型。它提供了使用远程内存访问(RMA)在片上消息传递缓冲区(MPB)之间移动数据的功能。消息传递应用程序的性能直接受集体操作(例如广播)效率的影响。在本文中,我们研究了如何利用MPB实现SCC的高效广播算法。我们提出了OC-Bcast(片上广播),这是一种流水线k进制树算法,旨在利用片上RMA提供的并行性。使用基于LogP的模型,我们提供了一个分析评估,该评估将我们的算法与为SCC实施的最新广播算法进行了比较。正如该模型所预测的,实验结果表明,OC-Bcast的吞吐量几乎提高了三倍,并且延迟提高了至少27%。此外,分析评估突出了我们方法的好处:与其他考虑的广播算法不同,OC-Bcast直接利用RMA的优势,后者基于更高级别的发送/接收接口。这使我们得出结论,需要基于RMA的集体操作才能充分利用未来消息传递多核体系结构的硬件功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号