Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architectures: implementation of a 1D c++/OpenMP electrostatic particle-in-cell code

A. Vapirev; J. Deca; G. Lapenta; S. Markidis; I. Hur; J.-L. Cambier

首页> 外文期刊>Concurrency and computation: practice and experience >Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architectures: implementation of a 1D c++/OpenMP electrostatic particle-in-cell code

【24h】

Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architectures: implementation of a 1D c++/OpenMP electrostatic particle-in-cell code

机译：英特尔许多集成核，沙桥和图形处理单元体系结构的计算性能的初步结果：实现一维c ++ / OpenMP静电粒子编码

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We present initial comparison performance results for Intel many integrated core (MIC), Sandy Bridge (SB),rnand graphical processing unit (GPU). A 1D explicit electrostatic particle-in-cell code is used to simulate arntwo-stream instability in plasma. We compare the computation times for various number of cores/threadsrnand compiler options. The parallelization is implemented via OpenMP with a maximum thread number ofrn128. Parallelization and vectorization on the GPU is achieved with modifying the code syntax for compatibilityrnwith CUDA. We assess the speedup due to various auto-vectorization and optimization level compilerrnoptions. Our results show that the MIC is several times slower than SB for a single thread, and it becomesrnfaster than SB when the number of cores increases with vectorization switched on. The compute times forrnthe GPU are consistently about six to seven times faster than the ones for MIC. Compared with SB, thernGPU is about two times faster for a single thread and about an order of magnitude faster for 128 threads.rnThe net speedup, however, for MIC and GPU are almost the same. An initial attempt to offload parts of therncode to the MIC coprocessor shows that there is an optimal number of threads where the speedup reaches arnmaximum.

机译：我们提供了英特尔许多集成核心（MIC），桑迪桥（SB），图形处理器（GPU）的初步比较性能结果。一维显式静电粒子内代码用于模拟等离子体中的双流不稳定性。我们比较了各种内核/线程和编译器选项的计算时间。并行化是通过OpenMP实现的，最大线程数为rn128。通过修改代码语法以实现与CUDA的兼容性，可以在GPU上实现并行化和矢量化。我们评估由于各种自动矢量化和优化级别的编译器选项而导致的加速。我们的结果表明，对于单个线程，MIC比SB慢几倍，并且当向量化打开时内核数增加时，MIC比SB快。 GPU的计算时间始终比MIC快约6至7倍。与SB相比，单个线程的GPU大约快两倍，而128个线程的GPU快大约一个数量级。然而，MIC和GPU的净提速几乎相同。最初尝试将部分代码卸载到MIC协处理器，这表明加速达到arnmaximum时存在最佳线程数。

著录项

来源
《Concurrency and computation: practice and experience》 |2015年第3期|581-593|共13页
作者
A. Vapirev; J. Deca; G. Lapenta; S. Markidis; I. Hur; J.-L. Cambier;
展开▼
作者单位

Department of Mathematics, KU Leuven, Celestijnenlaan 200b bus 2400, Heverlee 3001, Belgium Intel ExaScience Lab, Kapeldreef 75, B-3001 Leuven, Belgium;

Department of Mathematics, KU Leuven, Celestijnenlaan 200b bus 2400, Heverlee 3001, Belgium;

Department of Mathematics, KU Leuven, Celestijnenlaan 200b bus 2400, Heverlee 3001, Belgium;

PDC Centre, KTH Royal Institute of Technology, Stockholm, Sweden;

Intel ExaScience Lab, Kapeldreef 75, B-3001 Leuven, Belgium;

AFRL/PRSA, Edwards AFB, California 93524, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
coprocessor; many integrated cores; particle-in-cell; heterogeneous computing;

机译：协处理器许多集成核心;细胞内颗粒异构计算;

相似文献

外文文献
专利

1. Multi-Threaded Computation of the Sobel Image Gradient on Intel Multi-Core Processors Using OpenMP Library [J] . Ahmed Sherif Zekri International Journal of Computer Science & Information Technology (IJCSIT) . 2016,第2期

机译：使用OpenMP库在Intel多核处理器上进行Sobel图像梯度的多线程计算
2. Performance of a Second Order Electrostatic Particle-in-Cell Algorithm on Modern Many-Core Architectures [J] . Dominic A.S. Brown, Steven A. Wright, Stephen A. Jarvis Electronic Notes in Theoretical Computer Science . 2018,第1期

机译：二阶静电单元中粒子在现代多核体系结构上的性能
3. Manycore challenge in particle-in-cell simulation: How to exploit 1 TFlops peak performance for simulation codes with irregular computation [J] . Nakashima Hiroshi Computers and Electrical Engineering . 2015,第Null期

机译：单元格粒子模拟中的Manycore挑战：如何利用不规则计算的1 TFlops峰值性能来模拟代码
4. Initial results on computational performance of Intel Many Integrated Core (MIC) architecture: implementation of the Weather and Research Forecasting (WRF) Purdue-Lin microphysics scheme [C] . Jarno Mielikainen, Bormin Huang, Allen H.-L. Huang Conference on high-performance computing in remote sensing . 2014

机译：英特尔多核（MIC）架构计算性能的初步结果：天气和研究预报（WRF）普渡-林微物理学方案的实现
5. A computational model for developmental biology with parallel implementation on graphical processing unit. [D] . Sun, Wenzhao. 2015

机译：在图形处理单元上并行执行的发育生物学计算模型。
6. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards [O] . Francesc Massanes, Marie Cadennes, Jovan G. Brankov -1

机译：计算的统一设备架构实现块匹配算法的多个图形处理单元卡
7. On architecture and performance of adaptive mesh refinement in an electrostatics Particle-In-Cell code [O] . Matthias Frey, Andreas Adelmann, Uldis Locans 2020

机译：在静电粒子内码中自适应网格细化的体系结构与性能
8. Computational Performance of Intel MIC, Sandy Bridge, and GPU Architectures: Implementation of a 1D c++/OpenMP Electrostatic Particle-In-Cell Code [R] . Lapenta, G, Vapirev, A, Deca, J, 2014

机译：英特尔mIC，sandy Bridge和GpU架构的计算性能：1D c ++ / Openmp静电粒子在线代码的实现

Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architectures: implementation of a 1D c++/OpenMP electrostatic particle-in-cell code

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅