首页> 外文期刊>Annals of nuclear energy >Implementation of the CPU/GPU hybrid parallel method of characteristics neutron transport calculation using the heterogeneous cluster with dynamic workload assignment
【24h】

Implementation of the CPU/GPU hybrid parallel method of characteristics neutron transport calculation using the heterogeneous cluster with dynamic workload assignment

机译:使用具有动态工作负载分配的异构集群实现特征中子输运计算的CPU / GPU混合并行方法

获取原文
获取原文并翻译 | 示例
           

摘要

In recent years, graphics processing units (GPUs) have been adopted in many High-Performance Computing (HPC) systems due to their massive computational power and superior energy efficiency. And accelerating CPU-version computational code on heterogeneous clusters with multi-core CPUs and GPUs has attracted a lot of attention. One of the focus on heterogeneous computing is to efficiently take advantage of all computational resources, including both CPU and GPU available on a cluster. In this paper, a heterogeneous MPI + OpenMP/CUDA parallel algorithm for solving the 20 neutron transport equation with the method of characteristic (MOC) is implemented. In this algorithm, the spatial domain decomposition technique provides the coarse-grained parallelism with the MPI protocol while the fine-grained parallelism is exploited through OpenMP (in CPU calculated domain) and CUDA (in GPU calculated domain) based on the ray parallelization. In order to efficiently leverage the computing power of heterogeneous clusters, a dynamic workload assignment scheme is proposed, which is to distribute the workload based on the runtime performance of CPUs and CPUs in the cluster. Moreover, the strong scaling performance of the MPI + CUDA parallelization is studied through a performance analysis model which provides the detailed impact of the degradation in iteration scheme, the load imbalance issue, the data copy between CPUs and GPUs, and the MPI communication in the MPI + CUDA parallel algorithm. And the corresponding conclusion is still tenable for the MPI + OpenMP/CUDA parallelization. The C5G7 2D benchmark and an extended 20 whole-core problem are calculated with MPI + CUDA parallelization, MPI + OpenMP/CUDA parallelization, and the MPI parallelization for comparison. Numerical results demonstrate that the heterogeneous parallel algorithm maintains the desired accuracy. And the dynamic workload assignment scheme can provide the optimal workload assignment which ideally matches the experimental results. In addition, over 11% improvement is observed in MPI + OpenMP/CUDA parallelization compared against the MPI + CUDA parallelization. Moreover, the CPUs/CPUs heterogeneous clusters significantly outperform the CPUs clusters and one heterogeneous node shows basically five times faster than a CPUs node. (C) 2019 Elsevier Ltd. All rights reserved.
机译:近年来,由于图形处理单元(GPU)的强大计算能力和出色的能源效率,已在许多高性能计算(HPC)系统中采用。在具有多核CPU和GPU的异构集群上加速CPU版本的计算代码引起了很多关注。异构计算的重点之一是有效利用所有计算资源,包括群集上可用的CPU和GPU。本文采用特征(MOC)方法,实现了求解20个中子输运方程的MPI + OpenMP / CUDA并行算法。在该算法中,空间域分解技术通过MPI协议提供了粗粒度的并行性,而基于光线并行化的OpenMP(在CPU计算的域中)和CUDA(在GPU计算的域中)则利用了细粒度的并行性。为了有效利用异构集群的计算能力,提出了一种动态工作负载分配方案,该方案将根据集群中CPU和CPU的运行时性能来分配工作负载。此外,通过性能分析模型研究了MPI + CUDA并行化的强大扩展性能,该模型提供了迭代方案降级,负载不平衡问题,CPU和GPU之间的数据复制以及MPI通信中的详细影响。 MPI + CUDA并行算法。对于MPI + OpenMP / CUDA并行化,相应的结论仍然成立。使用MPI + CUDA并行化,MPI + OpenMP / CUDA并行化和MPI并行化来计算C5G7 2D基准测试和扩展的20个全核问题,以进行比较。数值结果表明,异构并行算法保持了所需的精度。动态工作量分配方案可以提供与实验结果理想匹配的最佳工作量分配。此外,与MPI + CUDA并行化相比,MPI + OpenMP / CUDA并行化的改进幅度超过11%。此外,CPU / CPU异构集群的性能明显优于CPU集群,并且一个异构节点的显示速度基本上比CPU节点快五倍。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号