首页> 外文期刊>ACM transactions on computer systems >A Study of Source-Level Compiler Algorithms for Automatic Construction of Pre-Execution Code
【24h】

A Study of Source-Level Compiler Algorithms for Automatic Construction of Pre-Execution Code

机译:自动执行预执行代码的源代码级编译器算法研究

获取原文

摘要

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts). We prototype our algorithms on the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [Lyle and Wallace 1997]. Using our prototype, we undertake a performance evaluation of our compilers on a detailed architectural simulator of an 8-way out-of-order SMT processor with 4 hardware contexts, and 13 applications selected from the SPEC and Olden benchmark suites. Our most aggressive compiler improves the performance of 10 out of 13 applications, reducing execution time by 20.9%. Across all 13 applications, our aggressive compiler achieves a harmonic average speedup of 17.6%. For our reduced compilers, eliminating program slicing and relying on back-end optimizations degrades performance minimally, suggesting that effective pre-execution compilers can be built without program slicing. Furthermore, without cache-miss profiles, we still achieve good speedup, 15.5%, but without loop-trip count profiles, we achieve a speedup of only 7.7%. Finally, our results show compiler-based pre-execution can benefit multiprogrammed workloads. Simultaneously executing applications achieve higher throughput with pre-execution compared to no pre-execution. Due to contention for hardware contexts, however, time-slicing outperforms simultaneous execution in some cases where individual applications make heavy use of pre-execution threads.
机译:预执行是一种有前途的等待时间容忍技术,它使用一个或多个在主计算之前在备用硬件上下文中运行的辅助线程来尽早触发长等待时间的内存操作,从而代表主计算吸收了它们的等待时间。本文研究了几种用于自动提取执行前线程代码的源到源C编译器,从而使程序员或硬件摆脱了繁重的工作。我们提出了一个积极的配置文件驱动的编译器,该编译器使用三种强大的算法进行代码提取。首先,程序切片会删除用于计算缺少缓存的内存引用的非关键代码。其次,预取转换用非阻塞的预取指令替换了阻塞的内存引用,以最大程度地减少预执行线程的停顿。最后,推测性循环并行化会生成线程级并行度,以容忍阻塞负载的延迟。另外,我们介绍了四个“精简”编译器,它们使用了较少攻击性的算法来简化编译器的实现。我们精简的编译器依靠后端代码优化而不是程序切片来删除非关键代码,并使用编译时启发式方法而不是通过剖析来近似运行时信息(例如,缓存丢失和循环次数)。我们在斯坦福大学中间格式(SUIF)框架和称为Unravel [Lyle and Wallace 1997]的公共程序切片器上对算法进行原型设计。使用我们的原型,我们在详细的架构仿真器上对编译器进行了性能评估,该仿真器具有8种无序SMT处理器,4种硬件环境以及从SPEC和Olden基准套件中选择的13种应用程序。我们最强大的编译器可提高13个应用程序中10个的性能,从而将执行时间减少20.9%。在所有13种应用中,我们积极的编译器均实现了17.6%的谐波平均加速。对于我们精简的编译器,消除程序切片并依靠后端优化会最小程度地降低性能,这表明可以在不进行程序切片的情况下构建有效的预执行编译器。此外,如果没有缓存未命中配置文件,我们仍然可以达到15.5%的良好加速比,但是如果没有循环次数计数配置文件,我们只能实现7.7%的加速比。最后,我们的结果表明,基于编译器的预执行可以使多程序工作负载受益。与没有预执行相比,通过预执行同时执行的应用程序可实现更高的吞吐量。但是,由于争用硬件上下文,在某些情况下单个应用程序大量使用了预执行线程时,时间片的性能优于同时执行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号