首页> 外文会议>IEEE International Symposium on Parallel Distributed Processing;IPDPS 2009 >A framework for efficient and scalable execution of domain-specific templates on GPUs
【24h】

A framework for efficient and scalable execution of domain-specific templates on GPUs

机译:在GPU上高效,可扩展地执行特定于域的模板的框架

获取原文

摘要

Graphics processing units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain-specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts-processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, of-fload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lower-level frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7-7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6 GB and 17 GB, respectively, on GPU platforms with only 768 MB and 1.5 GB of memory.
机译:图形处理单元(GPU)已成为计算行业从顺序计算向多核和多核计算过渡的重要参与者。我们提出了一种用于在GPU上执行特定于域的并行模板的软件框架,该框架同时提高了GPU编程的抽象级别,并确保了对大数据大小和新GPU平台的可向前扩展的有效执行。为了实现可扩展和高效的GPU执行,我们的框架着重解决了两个关键问题,这些问题在以前的工作中已被广泛忽略:处理不适合GPU内存的大数据集,并最大程度地减少主机和GPU之间的数据传输。我们的框架采用表示为并行运算符图的特定于域的并行编程模板,并执行运算符拆分,负载单位识别以及调度卸载计算和主机与GPU之间的数据传输,以生成高度优化的执行计划。最后,代码生成器根据派生的执行计划生成混合CPU / GPU程序,该程序使用较低级别的框架(例如CUDA)。我们已经将提出的框架应用于来自识别域的模板,特别是图像和视频分析中常用的边缘检测内核和卷积神经网络。我们在NVIDIA的两个不同GPU平台(特斯拉C870 GPU计算卡和GeForce 8800显卡)上展示了结果,与已经加速的基准GPU实施相比,它们展示了1.7-7.8倍的性能提升。我们还展示了在只有768 MB和1.5 GB内存的GPU平台上,分别可输入6 GB和17 GB的数据集和应用程序内存占用空间的可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号