首页> 外文期刊>Computer physics communications >Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
【24h】

Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

机译:集群上的双曲线并行PDE仿真:单元与GPU

获取原文
获取原文并翻译 | 示例
       

摘要

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.
机译:高性能计算越来越多地期望数据并行计算设备能够提高计算性能。 IBM的Cell Processor和NVIDIA的用于图形处理单元(GPU)计算的CUDA编程模型已受到广泛关注。在本文中,我们研究了在具有Cell和GPU后端的群集上具有显式时间积分的结构化网格上并行双曲型偏微分方程仿真的加速。消息传递接口(MPI)用于以最粗糙的并行度在节点之间进行通信。根据数据布局,数据流和数据并行指令,描述了数据并行设备在几个更好的并行度级别上对仿真代码的优化。将优化的单元和GPU性能与单个x86中央处理器(CPU)内核上单精度和双精度的参考代码性能进行比较。我们还将基于芯片对芯片比较CPU,Cell和GPU平台,并在共享内存配置(无MPI)中比较具有两个CPU,两个Cell处理器或两个GPU的单个群集节点上的性能。最后,我们使用MPI比较具有32个CPU,32个单元处理器和32个GPU的群集的性能。我们的GPU集群结果使用具有GT200架构的NVIDIA Tesla GPU,但其中也包括最近推出的具有下一代Fermi架构的NVIDIA GPU的一些初步结果。本文为正在考虑将其代码移植到加速器环境的计算科学家和工程师提供了有关如何针对Cell和GPU加速器的集群优化基于结构化网格的显式算法的见解。它还提供了对于此类应用程序的当前和将来的加速器体系结构可能获得的加速的见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号