首页> 外文会议>International Symposium on Microarchitecture >Improving the Effectiveness of Searching for Isomorphic Chains in Superword Level Parallelism
【24h】

Improving the Effectiveness of Searching for Isomorphic Chains in Superword Level Parallelism

机译:提高超键平行中搜索同构枢尾链的有效性

获取原文

摘要

Most high-performance microprocessors come equipped with general purpose Single Instruction Multiple Data (SIMD) execution engines to enhance performance. Compilers use auto-vectorization techniques to identify vector parallelism and generate SIMD code so that applications can enjoy the performance benefits provided by SIMD units. Superword Level Parallelism (SLP), one such vectorization technique, forms vector operations by merging isomorphic instructions into a vector operation and linking many such operations into long isomorphic chains. However, effective grouping of isomorphic instructions remains a key challenge for SLP algorithms. In this work, we describe a new hierarchical approach for SLP. We decouple the selection of isomorphic chains and arrange them in a hierarchy of choices at the local and global levels. First, we form small local chains from a set of preferred patterns and rank them. Next, we form long global chains from the local chains using a few simple heuristics. Hierarchy allows us to balance the grouping choices of individual instructions more effectively within the context of larger local and global chains, thereby finding better opportunities for vectorization. We implement our algorithm in LLVM, and we compare it against prior work and the current SLP implementation in LLVM. A set of applications that benefit from vectorization are taken from the NAS Parallel Benchmarks and SPEC CPU 2006 suite to compare our approach and prior techniques. We demonstrate that our new algorithm finds better isomorphic chains. Our new approach achieves an 8.6% speedup, on average, compared to non-vectorized code and 2.5% speedup, on average, over LLVM-SLP. In the best case, the BT application has 11% fewer total dynamic instructions and achieves a 10.9% speedup over LLVM-SLP.
机译:大多数高性能微处理器配备通用单指令多数据(SIMD)执行引擎,以提高性能。编译器使用自动矢量化技术来识别矢量并行性并生成SIMD代码,以便应用程序可以享受SIMD单元提供的性能优势。卓遍的等级行度(SLP),一种这样的矢量化技术,通过将同构指令合并到向量操作中并将许多这样的操作链接到长同义链中来形成矢量操作。然而,有效分组的同构指令仍然是SLP算法的关键挑战。在这项工作中,我们描述了一种新的SLP的分层方法。我们脱钩了同构链的选择,并在当地和全球层面的选择层次中排列。首先,我们从一组优选的图案中形成小型本地链,并对它们进行排名。接下来,我们使用几个简单的启发式从当地链形成长全球链。层次结构允许我们更有效地在较大的本地和全球链的背景下更有效地平衡个别指示的分组选择,从而找到了矢量化的更好机会。我们在LLVM中实现了我们的算法,我们将其与LLVM中的事先工作和当前的SLP实现进行比较。从Vectiveization中受益的一组应用程序来自NAS并行基准和规范CPU 2006套件,以比较我们的方法和现有技术。我们展示了我们的新算法发现更好的同构轴。我们的新方法平均地实现了8.6%的加速,与非矢量化代码相比,平均而不是LLVM-SLP的加速2.5%。在最佳情况下,BT应用程序的总动态指令较少,并且在LLVM-SLP上实现了10.9%的加速。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号