...
首页> 外文期刊>Journal of supercomputing >Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors
【24h】

Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

机译:在Intel KNL和Intel Skylake-SP处理器上自动调整GEMM内核

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The general matrix-matrix multiplication is a core building block for implementing Basic Linear Algebra Subprograms. This paper presents a methodology for automatically producing the matrix-matrix multiplication kernels tuned for the Intel Xeon Phi Processor code-named Knights Landing and the Intel Skylake-SP processors with AVX-512 intrinsic functions. The architecture of the latest manycore processors has been complicated in the levels of parallelism and cache hierarchies; it is not easy to find the best combination of optimization techniques for a given application. Our approach produces matrix multiplication kernels through a process of heuristic auto-tuning based on generating multiple kernels and selecting the fastest ones through performance tests. The tuning parameters include the size of block matrices for registers and caches, prefetch distances, and loop unrolling depth. Parameters for multithreaded execution, such as identifying loops to parallelize and the optimal number of threads for such loops are also investigated. We also present a method to reduce the parameter search space based on our previous research results.
机译:通用矩阵矩阵乘法是实现基本线性代数子程序的核心构建块。本文介绍了一种方法,该方法可自动生成针对代号为Knights Landing的Intel Xeon Phi处理器和具有AVX-512内在功能的Intel Skylake-SP处理器调整的矩阵矩阵乘法内核。最新的多核处理器的体系结构在并行性和缓存层次结构级别上非常复杂。对于给定的应用,很难找到最佳化技术的最佳组合。我们的方法通过基于生成多个内核并通过性能测试选择最快的内核的启发式自动调整过程来生成矩阵乘法内核。调整参数包括用于寄存器和高速缓存的块矩阵的大小,预取距离以及循环展开深度。还研究了用于多线程执行的参数,例如标识要并行化的循环以及此类循环的最佳线程数。我们还根据以前的研究结果提出了一种减少参数搜索空间的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号