A framework for efficient and scalable execution of domain-specific templates on GPUs

机译：在GPU上高效，可扩展地执行特定于域的模板的框架

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Graphics processing units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain-specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts-processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, of-fload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lower-level frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7-7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6 GB and 17 GB, respectively, on GPU platforms with only 768 MB and 1.5 GB of memory.

机译：图形处理单元（GPU）已成为计算行业从顺序计算向多核和多核计算过渡的重要参与者。我们提出了一种用于在GPU上执行特定于域的并行模板的软件框架，该框架同时提高了GPU编程的抽象级别，并确保了对大数据大小和新GPU平台的可向前扩展的有效执行。为了实现可扩展和高效的GPU执行，我们的框架着重解决了两个关键问题，这些问题在以前的工作中已被广泛忽略：处理不适合GPU内存的大数据集，并最大程度地减少主机和GPU之间的数据传输。我们的框架采用表示为并行运算符图的特定于域的并行编程模板，并执行运算符拆分，负载单位识别以及调度卸载计算和主机与GPU之间的数据传输，以生成高度优化的执行计划。最后，代码生成器根据派生的执行计划生成混合CPU / GPU程序，该程序使用较低级别的框架（例如CUDA）。我们已经将提出的框架应用于来自识别域的模板，特别是图像和视频分析中常用的边缘检测内核和卷积神经网络。我们在NVIDIA的两个不同GPU平台（特斯拉C870 GPU计算卡和GeForce 8800显卡）上展示了结果，与已经加速的基准GPU实施相比，它们展示了1.7-7.8倍的性能提升。我们还展示了在只有768 MB和1.5 GB内存的GPU平台上，分别可输入6 GB和17 GB的数据集和应用程序内存占用空间的可扩展性。

著录项

来源
《IEEE International Symposium on Parallel Distributed Processing;IPDPS 2009》|2009年|1-12|共12页
会议地点 Rome(IT);Rome(IT)
作者
Sundaram N.; Raghunathan A.; Chakradhar S.T.;
展开▼
作者单位

NEC Labs. America, Princeton, NJ, USA;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
coprocessors; edge detection; neural nets; parallel programming; program compilers; scheduling; CPU; CUDA; GeForce 8800 graphics card; NVIDIA; Tesla C870 GPU computing card; code generator; computing industry; convolutional neural networks; domain-specific templates; edge detection kernels; forward scalability; graphics processing unit programming; image analysis; lower-level frameworks; many-core computing; multi-core computing; of-fload unit identification; off-loaded computation scheduling; parallel operator graphs;

机译：协处理器;边缘检测;神经网络;并行编程;程序编译器;调度; CPU; CUDA; GeForce 8800显卡; NVIDIA; Tesla C870 GPU计算卡;代码生成器;计算行业;卷积神经网络;特定领域的模板;边缘检测内核;前向可伸缩性;图形处理单元编程;图像分析;低级框架;多核计算;多核计算;负荷单元识别;分流计算调度;并行运算符图;;

相似文献

外文文献
中文文献
专利

1. Juggler: A Dependence-Aware Task-Based Execution Framework for GPUs [J] . Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2018,第1期

机译：Juggler：GPU的基于依赖的任务的执行框架
2. Panda: A Compiler Framework for Concurrent CPU+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers [J] . Mohammed Sourouri, Scott B. Baden, Xing Cai International journal of parallel programming . 2017,第3期

机译：Panda：在GPU加速的超级计算机上同时执行3D模具计算的CPU + GPU执行的编译器框架
3. Graph-Waving architecture: Efficient execution of graph applications on GPUs [J] . Ayse Yilmazer-Metin Journal of Parallel and Distributed Computing . 2021,第Feba期

机译：图挥舞架构：高效执行GPU上的图形应用程序
4. A framework for efficient and scalable execution of domain-specific templates on Gus [C] . Narayanan Sundaram, Anand Raghunathan, Srimat T. Chakradhar International Symposium on Parallel Distributed Processing . 2009

机译：在GUS上有效和可扩展的域特定模板执行的框架
5. Efficient time-energy execution of data-parallel applications on heterogeneous systems with GPU [D] . Loghin, Dumitrel. 2017

机译：使用GPU在异构系统上高效执行数据并行应用程序的时间能量
6. Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs GPUs and MICs: A Case Study with Microscopy Image Analysis [O] . George Teodoro, Tahsin Kurc, Guilherme Andrade, -1

机译：具有多核CPUGPU和MIC的系统上的应用程序性能分析和高效执行：以显微镜图像分析为例
7. Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs [O] . Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, 2018

机译：优化CCSD（T）中的张量凹陷，以便在GPU上进行高效执行

A framework for efficient and scalable execution of domain-specific templates on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅