首页> 外文期刊>Parallel Computing >Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies
【24h】

Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

机译:调整波前算法以有效利用具有深层通信层次的系统

获取原文
获取原文并翻译 | 示例

摘要

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance especially in hybrid systems using accelerators. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contains wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional steps in the parallel computation and higher use of on-chip communications. This tradeoff is explored using a performance model. An implementation using the reverse-acceleration programming model on the petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in communication performance exists.
机译:大型系统越来越表现出芯片内和芯片间通信性能之间的差异,尤其是在使用加速器的混合系统中。与同一节点内或节点之间的不同套接字上的内核相比,同一套接字上的处理器内核能够以较低的延迟和更高的带宽进行通信。一个关键的挑战是有效地使用此通信层次结构,从而优化性能。我们在这里考虑包含波前处理的应用程序类别。在这些应用程序中,只有在处理了它们的上游邻居之后才能处理数据。处理器之间会产生类似的依存关系,在这些处理器之间,需要进行通信才能将边界数据传递到下游,并且其成本通常受到使用中最慢的通信通道的影响。在这项工作中,我们开发了一种新颖的分层波前方法,该方法减少了分层中较慢的通信的使用,但以并行计算中的附加步骤和片上通信的更多使用为代价。使用性能模型来探索这种折衷。在petascale Roadrunner系统上使用反向加速编程模型的实现表明,在内核应用程序的整个系统范围内,性能提高了27%。该方法通常适用于通信性能存在差异的大规模多核和加速系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号