首页> 外文会议>ACM international conference on supercomputing >Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations
【24h】

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations

机译:编译器和运行时支持在异构并行配置上启用通用减少计算

获取原文

摘要

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today's heterogeneous machines. This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x relative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.
机译:一种实现的趋势,并引起了很多关注,是越来越多的计算平台。目前,桌面或笔记本电脑已经非常常见,以配备多核CPU和GPU。利用这些架构的最大计算能力(即,通过同时利用从高级API开始的多核CPU和GPU,是一个危急挑战。我们认为,支持程序员实现当今异构机器的全部潜力是非常希望的。本文介绍了一个编译器和运行时框架,可以映射一类应用程序,即具有多核CPU和GPU的系统所表征的那些应用程序。从简单的C函数开始,添加了额外的注释,我们会自动为多核生成中间件API代码,以及CUDA代码同时利用GPU。运行时系统提供有效的方案,用于动态分区CPU核心和GPU之间的工作。我们的实验结果来自两种应用,例如K-Means聚类和主成分分析(PCA),表明,通过有效利用异构架构,我们可以实现与仅使用GPU或多核CPU相比的显着更高的性能。在K-Means中,具有8个CPU核心的异构版本和GPU相对于1线CPU实现了约32.09倍的加速。与仅限CPU和GPU执行的速度相比,我们能够达到约60%的性能增益。在PCA中,异构版本相对于1线CPU版本达到了10.4倍的加速。与仅限CPU和GPU版本的速度相比,我们实现了约63.8%的性能增益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号