首页> 外文期刊>International journal of parallel programming >Automated Compiler Optimization of Multiple Vector Loads/Stores
【24h】

Automated Compiler Optimization of Multiple Vector Loads/Stores

机译:多个向量加载/存储的自动编译器优化

获取原文
获取原文并翻译 | 示例

摘要

Abstract With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel $${textregistered }$$ ® Xeon processor (Haswell—HSW), up to 25% on the Intel $${textregistered }$$ ® Xeon $$hbox {Phi}^{mathrm{TM}}$$ Phi TM coprocessor (Knights Corner—KNC), and up to 430% on the Intel $${textregistered }$$ ® Xeon $$hbox {Phi}^{mathrm{TM}}$$ Phi TM processor with AVX-512 instructions support (Knights Landing—KNL).
机译:摘要随着向量的扩展以及当今处理器中高级向量指令的普及,向量化在提供应用程序性能方面扮演着越来越重要的角色。要实现这种矢量硬件的性能潜力,需要软件级别的大力支持,例如新的显式矢量编程模型和高级矢量化编译器。如今,结合使用这些软件工具和新的SIMD ISA扩展(例如收集/分散指令),发现即使具有复杂和不规则数据访问模式的代码也可以矢量化,这已经很普遍。在本文中,我们将重点放在具有不规则访问的这些矢量化代码上,并表明尽管一流的Intel Compiler Vectorizer确实通过有效的矢量化提供了加速,但仍有一些机会可以通过巧妙的程序转换进一步提高性能。在确定了这些机会之后,本文介绍了针对这些数据访问模式的两种自动编译器优化。第一个优化重点是提高一组相邻聚集/散布的性能。第二个优化使用更有效的SIMD指令提高了一组模板矢量访问的性能。两种优化现在都在17.0版的Intel Compiler中实现。我们使用广泛的微内核,代表性基准测试和应用程序内核来评估优化。在这些基准测试中,我们证明了Intel $$ {textregistered} $$®Xeon处理器(Haswell-HSW)的性能提高了3-750%,而Intel $$ {textregistered} $$®Xeon $$则提高了25%。 hbox {Phi} ^ {mathrm {TM}} $$ Phi TM协处理器(Knights Corner—KNC),以及Intel $$ {textregistered} $$®Xeon $$ hbox {Phi} ^ {mathrm { TM}} $$ Phi TM处理器,带有AVX-512指令支持(骑士降落—KNL)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号