Efficient NAS Parallel Benchmark Kernels with CUDA

机译：使用CUDA的高效NAS并行基准内核

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels.

机译：NAS并行基准（NPB）是用于评估并行硬件和软件的标准基准套件之一。除了原始的OpenMP和MPI，还有许多研究工作试图提供不同的并行版本。关于GPU加速器，只有OpenCL和OpenACC作为合并版本可用。我们的目标是使用CUDA提供五个NPB内核的高效并行实现。我们的贡献涵盖不同方面。首先，遵循最佳并行编程实践以使用CUDA实现NPB内核。其次，对更大的工作负载（B和C类）的支持可以强调并研究强大的GPU的内存。第三，尽管基准测试过去是针对CPU设计的，但我们证明可以使NPB高效且适合GPU。在某些情况下，我们成功地实现了有关最新技术的双重性能，并实现了有效的内存使用。第四，我们讨论了使用相对新的GPU架构比较OpenACC和OpenCL最新版本的性能和内存使用情况的新实验。实验结果还表明，与OpenACC和OpenCL相比，我们的版本是所有NPB内核中最好的版本。对于FT和EP内核观察到最大的差异。

著录项

来源
《Euromicro International Conference on Parallel, Distributed and Network-Based Processing》|2020年|9-16|共8页
会议地点
作者
Gabriell Alves de Araujo; Dalvan Griebler; Marco Danelutto; Luiz Gustavo Fernandes;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Graphics processing units; Kernel; Benchmark testing; Programming; Computer architecture; Hardware; Standards;

机译：图形处理单元;内核;基准测试;编程;计算机体系结构;硬件;标准;

相似文献

外文文献
中文文献
专利

1. A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit [J] . Filip Petrovič, David Střelák, Jana Hozzová, Future generation computer systems . 2020,第Jula期

机译：一套高效的CUDA和OpenCL内核的基准测试集，以及使用内核调整工具包进行的动态自动调整
2. High-performance parallel implementations of the NAS kernel benchmarks on the IBM SP2 [J] . IBM Systems Journal . 1995,第2期

机译：NAS SP2上NAS内核基准测试的高性能并行实现
3. High-performance parallel implementations of the NAS kernel benchmarks on the IBM SP2 [J] . R. C. Agarwal, B. Alpern, L. Carter, IBM Systems Journal . 1995,第2期

机译：NAS SP2上NAS内核基准测试的高性能并行实现
4. Efficient NAS Parallel Benchmark Kernels with CUDA [C] . Gabriell Alves de Araujo, Dalvan Griebler, Marco Danelutto, Euromicro International Conference on Parallel, Distributed and Network-Based Processing . 2020

机译：与CUDA的高效NAS并行基准核
5. Efficient GPU Parallelization of the Agent-Based Models Using MASS CUDA Library [D] . Kosiachenko, Elizaveta. 2018

机译：使用质量CUDA文库的基于代理的模型的高效GPU并行化
6. Parallelized Seeded Region Growing Using CUDA [O] . Seongjin Park, Jeongjin Lee, Hyunna Lee, 2014

机译：使用CUDA并行播种区域
7. CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-Core Clusters [O] . Raghu Prabhakar, R. Govindarajan, Matthew J. Thazhuthaveetil 2013

机译：CUDa-For-Clusters：在多核集群上高效执行CUDa内核的系统
8. A Standard C Port of the NAS Kernels Benchmark Program [R] . Stockdale, I. E. 1993

机译：Nas内核基准程序的标准C端口

Efficient NAS Parallel Benchmark Kernels with CUDA

摘要

著录项

相似文献

相关主题

期刊订阅