首页> 外文期刊>Cluster computing >Scalable PGAS collective operations in NUMA clusters
【24h】

Scalable PGAS collective operations in NUMA clusters

机译:NUMA集群中的可扩展PGAS集合操作

获取原文
获取原文并翻译 | 示例
       

摘要

The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex interconnects in order to dispatch the increasing amount of data required by the processing elements. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one sided communications becomes more important in these systems, to avoid unnecessary synchronization between pairs of processes in collective operations implemented in terms of two sided point to point functions. This work proposes a series of algorithms that provide a good performance and scalability in collective operations, based on the use of hierarchical trees, overlapping one-sided communications, message pipelining and the available NUMA binding features. An implementation has been developed for Unified Parallel C, a Partitioned Global Address Space language, which presents a shared memory view across the nodes for programmability, while keeping private memory regions for performance. The performance evaluation of the proposed implementation, conducted on five representative systems (JuRoPA, JUDGE, Finis Terrae, SVG and Superdome), has shown generally good performance and scalability, even outperforming MPI in some cases, which confirms the suitability of the developed algorithms for manycore architectures.
机译:每个处理器不断增加的内核数量正在使许多基于内核的系统无处不在。这涉及处理非统一内存访问(NUMA)系统和处理器核心层次结构中的多层内存,可通过复杂的互连访问这些层次,以分派处理元素所需的越来越多的数据。有效和可伸缩地提供数据的关键是使用集体通信操作,以最大程度地减少瓶颈的影响。在这些系统中,利用单方通信变得更加重要,以避免在按照单点对点功能实现的集体操作中成对的进程之间不必要的同步。这项工作提出了一系列算法,这些算法基于使用分层树,重叠的单边通信,消息管道和可用的NUMA绑定功能,在集体操作中提供了良好的性能和可伸缩性。已经针对统一并行C(一种分区的全局地址空间语言)开发了一种实现,该实现提供了跨节点的共享内存视图以实现可编程性,同时保留专用内存区域以提高性能。在五个有代表性的系统(JuRoPA,JUDGE,Finis Terrae,SVG和Superdome)上进行的拟议实施的性能评估显示总体上良好的性能和可扩展性,甚至在某些情况下甚至优于MPI,这证实了所开发算法在以下方面的适用性: manycore体系结构。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号