首页> 外文会议>International Symposium on Parallel Distributed Processing >A framework for efficient and scalable execution of domain-specific templates on Gus
【24h】

A framework for efficient and scalable execution of domain-specific templates on Gus

机译:在GUS上有效和可扩展的域特定模板执行的框架

获取原文

摘要

Graphics Processing Units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain-specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts - processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, offload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lower-level frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7-7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6GB and 17GB, respectively, on GPU platforms with only 768MB and 1.5GB of memory.
机译:图形处理单元(GPU)已成为计算行业过渡到多核计算的重要参与者。我们提出了一种在GPU上执行域特定的并行模板的软件框架,它同时提高GPU编程的抽象级别,并确保高效执行,以向大数据大小和新的GPU平台进行前进可扩展性。为了实现可扩展和高效的GPU执行,我们的框架侧重于在以前的工作中大大忽略的两个关键问题 - 处理不符合GPU内存内的大数据集,并最大限度地减少主机和GPU之间的数据传输。我们的框架采用域特定的并行编程模板,表示为并行运算符图表,并执行主机和GPU之间的关闭计算和数据传输的操作员分割,卸载单元标识和调度,以生成高度优化的执行计划。最后,代码生成器根据派生执行计划产生混合CPU / GPU程序,该计划使用诸如CUDA的较低级别的框架。我们已经将建议的框架应用于来自识别域的模板,具体地是在图像和视频分析中常用的边缘检测核和卷积神经网络。我们在NVIDIA(TESLA C870 GPU计算卡和GEForce 8800显卡)上的两种不同GPU平台上显示了结果,这些表现出1.7-7.8x的性能改进,超过已经加速的基线GPU实现。我们还展示了在仅768MB和1.5GB内存的GPU平台上分别对6GB和17GB的数据集和应用程序存储空间进行可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号