首页> 外文期刊>Future generation computer systems >A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit
【24h】

A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit

机译:一套高效的CUDA和OpenCL内核的基准测试集,以及使用内核调整工具包进行的动态自动调整

获取原文
获取原文并翻译 | 示例
       

摘要

In recent years, the heterogeneity of both commodity and supercomputers hardware has increased sharply. Accelerators, such as CPUs or Intel Xeon Phi co-processors, are often key to improving speed and energy efficiency of highly-parallel codes. However, due to the complexity of heterogeneous architectures, optimization of codes for a certain type of architecture as well as porting codes across different architectures, while maintaining a comparable level of performance, can be extremely challenging. Addressing the challenges associated with performance optimization and performance portability, autotuning has gained a lot of interest. Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various CPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain.
机译:近年来,商品和超级计算机硬件的异质性急剧增加。 CPU或Intel Xeon Phi协处理器等加速器通常是提高高度并行代码的速度和能源效率的关键。但是,由于异构体系结构的复杂性,在保持相当水平的性能的同时,优化某种类型的体系结构的代码以及在不同体系结构之间移植代码非常困难。为了解决与性能优化和性能可移植性相关的挑战,自动调整引起了很多兴趣。与性能相关的源代码参数的自动调整允许在不进行硬编码优化的情况下自动调整应用程序,从而有助于保持性能的可移植性。在本文中,我们针对由OpenCL或CUDA实现的重要计算问题,介绍了一个包含10个可自动调整内核的基准集。使用我们的内核优化工具包,我们可以证明,通过自动优化,大多数内核在各种CPU上均达到了近乎峰值的性能,并且在CPU和Xeon Phis上均优于基准实现。我们的评估还表明,自动调整是性能可移植性的关键。除了离线调整外,我们还介绍了在应用程序运行时动态优化代码优化参数的过程。通过动态调整,内核调整工具包使应用程序可以在需要时(例如,输入数据更改时)在运行时重新调整对性能至关重要的内核。尽管通常认为自动调整空间往往太大而无法在应用程序运行时进行搜索,但我们表明,合理设计调整空间不一定是这种情况。我们的许多内核都具有适度大小的调整空间,可以在运行时以可接受的开销进行搜索,从而达到接近峰值的性能。最后,我们演示了如何将动态性能调节集成到来自冷冻电子显微镜领域的实际应用中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号