Automated Compiler Optimization of Multiple Vector Loads/Stores

Farhana Aleen; Vyacheslav P. Zakharin; Rakesh Krishnaiyer; Garima Gupta; David Kreitzer; Chang-Sun Lin

首页> 外文期刊>International journal of parallel programming >Automated Compiler Optimization of Multiple Vector Loads/Stores

【24h】

Automated Compiler Optimization of Multiple Vector Loads/Stores

机译：多个向量加载/存储的自动编译器优化

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Abstract With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel $${textregistered }$$ ® Xeon processor (Haswell—HSW), up to 25% on the Intel $${textregistered }$$ ® Xeon $$hbox {Phi}^{mathrm{TM}}$$ Phi TM coprocessor (Knights Corner—KNC), and up to 430% on the Intel $${textregistered }$$ ® Xeon $$hbox {Phi}^{mathrm{TM}}$$ Phi TM processor with AVX-512 instructions support (Knights Landing—KNL).

机译：摘要随着向量的扩展以及当今处理器中高级向量指令的普及，向量化在提供应用程序性能方面扮演着越来越重要的角色。要实现这种矢量硬件的性能潜力，需要软件级别的大力支持，例如新的显式矢量编程模型和高级矢量化编译器。如今，结合使用这些软件工具和新的SIMD ISA扩展（例如收集/分散指令），发现即使具有复杂和不规则数据访问模式的代码也可以矢量化，这已经很普遍。在本文中，我们将重点放在具有不规则访问的这些矢量化代码上，并表明尽管一流的Intel Compiler Vectorizer确实通过有效的矢量化提供了加速，但仍有一些机会可以通过巧妙的程序转换进一步提高性能。在确定了这些机会之后，本文介绍了针对这些数据访问模式的两种自动编译器优化。第一个优化重点是提高一组相邻聚集/散布的性能。第二个优化使用更有效的SIMD指令提高了一组模板矢量访问的性能。两种优化现在都在17.0版的Intel Compiler中实现。我们使用广泛的微内核，代表性基准测试和应用程序内核来评估优化。在这些基准测试中，我们证明了Intel $$ {textregistered} $$®Xeon处理器（Haswell-HSW）的性能提高了3-750％，而Intel $$ {textregistered} $$®Xeon $$则提高了25％。 hbox {Phi} ^ {mathrm {TM}} $$ Phi TM协处理器（Knights Corner—KNC），以及Intel $$ {textregistered} $$®Xeon $$ hbox {Phi} ^ {mathrm { TM}} $$ Phi TM处理器，带有AVX-512指令支持（骑士降落—KNL）。

著录项

来源
《International journal of parallel programming》 |2018年第2期|471-503|共33页
作者
Farhana Aleen; Vyacheslav P. Zakharin; Rakesh Krishnaiyer; Garima Gupta; David Kreitzer; Chang-Sun Lin;
展开▼
作者单位

Intel Corporation;

Intel Corporation;

Intel Corporation;

Intel Corporation;

Intel Corporation;

Intel Corporation;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
SIMD; Vectorization; Stencil codes; Adjacent access; Software write combining; Gather; Scatter;

机译：SIMD;矢量化;模板代码;相邻访问;软件写合并;聚集;分散;

相似文献

外文文献
中文文献
专利

1. Integration of multiple passive load mitigation technologies by automated design optimization-The case study of a medium-size onshore wind turbine [J] . Bortolotti Pietro, Bottasso Carlo L., Croce Alessandro, Wind Energy . 2019,第1期

机译：通过自动化设计优化集成多种被动负载缓解技术-以中型陆上风力发电机组为例
2. Automated test generation for optimizing compilers with OpenMP support [J] . Svyatoslav Pankratov MATEC Web of Conferences . 2018,第3期

机译：自动生成测试以优化具有OpenMP支持的编译器
3. Integration-deficient lentiviral vectors expressing codon-optimized R338L human FIX restore normal hemostasis in Hemophilia B mice. [J] . Thipparat Suwanmanee, Genlin Hu, Tong Gui, Molecular therapy: the journal of the American Society of Gene Therapy . 2014,第3期

机译：表达密码子优化的R338L人FIX的整合缺陷型慢病毒载体可恢复B型血友病小鼠的正常止血能力。
4. Automating Resources Discovery for Multiple Data Stores Cloud Applications [C] . Rami Sellami, Michel Vedrine, Sami Bhiri, International Conference on Cloud Computing and Services Science . 2015

机译：多个数据自动化资源发现存储云应用程序
5. A unified compiler framework for program analysis, optimization, and automatic vectorization with chains of recurrences [D] . Shou, Yixin 2009

机译：统一的编译器框架，用于程序分析，优化和带有循环链的自动矢量化
6. Integration-deficient Lentiviral Vectors Expressing Codon-optimized R338L Human FIX Restore Normal Hemostasis in Hemophilia B Mice [O] . Thipparat Suwanmanee, Genlin Hu, Tong Gui, 2014

机译：整合缺陷型慢病毒载体表达密码子优化的R338L人类FIX恢复血友病B小鼠正常止血。
7. Leveraging Compiler Alias Analysis To Free Accelerators from Load-Store Queues [O] . Vedula Naveen 2016

机译：利用编译器别名分析将加速器从负载存储队列中释放出来

Automated Compiler Optimization of Multiple Vector Loads/Stores

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅