首页> 外文期刊>Experimental Mechanics >Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor
【24h】

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor

机译:研究多核处理器上重叠的MPI无阻塞集合的进度线程放置和专用内核

获取原文
获取原文并翻译 | 示例
           

摘要

To amortize the cost of MPI collective operations, nonblocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. In this article, we propose placement algorithms for progress threads that do not degrade performance when running on cores dedicated to communications to get communication/computation overlap. We first show that even simple collective operations, such as those based on a chain topology, are not straightforward to make progress in background on a dedicated core. Then, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented both algorithms in the multiprocessor computing framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.
机译:为了摊销MPI集合操作的成本,已经提出了非阻塞集合,以使通信与计算重叠。不幸的是,集体通信比点对点通信更耗费CPU资源,并且在专用CPU内核上的通信线程中运行它们会使速度变慢。另一方面,在应用程序核心上运行集体通信不会导致重叠。在本文中,我们提出了用于进度线程的放置算法,该算法在专用于通信的内核上运行时不会降低性能,以使通信/计算重叠。我们首先表明,即使是简单的集体操作(例如基于链拓扑的操作)也不容易直接在专用内核的后台取得进展。然后,我们提出了一种用于基于树的集体操作的算法,该算法在通信核心和应用核心之间划分树。为了获得最好的效果,该算法在应用程序核心上运行树的短而沉重的部分,而在一个或多个通信核心上运行树的长而窄的部分,以便在重叠和绝对表现。我们提供了一个模型来研究和预测其行为并调整其参数。我们在多处理器计算框架中实现了这两种算法,这是基于线程的MPI实现。我们已经在许多核心处理器(例如KNL和Skylake)上运行了基准测试,并且在性能和重叠方面都取得了不错的成绩。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号