首页> 外文会议>International Conference on Parallel Architectures and Compilation Techniques >Automatic OpenCL work-group size selection for multicore CPUs
【24h】

Automatic OpenCL work-group size selection for multicore CPUs

机译:多核CPU的自动OpenCL工作组大小选择

获取原文

摘要

In this paper, we address the effect of the work-group size on the performance of OpenCL kernels. We propose a profiling-based algorithm that finds a good work-group size, in terms of performance, for the target multicore CPU architecture. Our algorithm reduces misses in the private L1 data cache and achieves load balancing between cores. It exploits the polyhedral model to estimate the working-set size and the number of cache misses for a parameterized work-group size of the OpenCL kernel. Based on the profiling information, it heuristically searches the space of parameterized work-group sizes. Our virtuallyext-ended index space helps to increase the probability to find a better work-group size. We implement our work-group size selection algorithm as a development tool that consists of a code generator and a search library. The code generator extracts the polytope of each memory reference from the kernel code and generates a function that simplifies polytopes using the run-time information and invokes search library routines. The search library calculates the working-set size using the polytopes and finds a proper work-group size. We evaluate our approach using 31 OpenCL kernels on four different multicore CPUs. We compare its accuracy and search time to those of an exhaustive search method. Experimental results show that our tool is, on average, 1566 times faster than the exhaustive search and selects a work-group size whose performance is the same as or comparable to that of the exhaustive search.
机译:在本文中,我们讨论了工作组大小对OpenCL内核性能的影响。我们提出了一种基于性能分析的算法,该算法在性能方面为目标多核CPU体系结构找到了一个不错的工作组大小。我们的算法减少了专用L1数据缓存中的遗漏,并实现了内核之间的负载平衡。它利用多面模型估算OpenCL内核的参数化工作组大小的工作集大小和高速缓存未命中数。基于概要分析信息,它启发式搜索参数化工作组大小的空间。我们几乎扩展的索引空间有助于增加找到更好的工作组规模的可能性。我们将工作组大小选择算法实现为包含代码生成器和搜索库的开发工具。代码生成器从内核代码中提取每个内存引用的多面体,并生成一个使用运行时信息简化多面体并调用搜索库例程的函数。搜索库使用多面体计算工作集大小,并找到合适的工作组大小。我们在四个不同的多核CPU上使用31个OpenCL内核来评估我们的方法。我们将其准确性和搜索时间与穷举搜索方法的准确性和搜索时间进行比较。实验结果表明,我们的工具平均比穷举搜索快1566倍,并选择了性能与穷举搜索相同或相当的工作组。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号