一种面向循环优化和非规则代码段的粗粒度半自动并行化方法

刘松; 赵博; 蒋庆; 伍卫国

摘要

Although the multi-core processors have become the mainstream processor architectures of the time,it is still hard to take advantage of the parallel computing power for many serial programs and software due to the lack of efficient parallelization.Manually re-engineering and refactoring of these legacy software is time consuming and costly.Therefore,the automatic parallelization techniques become the focus of attention in academia and industry.In this article,a novel semi-automatic parallelization approach is proposed targeting on optimization for regular for-loops and coarse-grained parallelism for irregular code sections in general programs.This approach employs a dynamic program analyzer to obtain the control-and data-dependences of programs.The gathered dependences information is used to form the Computational Unit (CU) graphs,and then the task graphs are further created,from which coarse-grained task parallelism of code sections can be extracted.Meanwhile,for the for-loop codes,a series of optimizations are adopted for code transformations.A profitable tiling model is proposed to effectively choose suitable loop codes for further optimization.The model is based on a mass of statistical data on locality analysis of loop iterations and it can determine whether the loop codes should perform loop tiling by invoking a loop transformation optimizer.The tile size selection (TSS) has an important impact on the performance of tiled codes and a uniform-mapping-in-cache-based tile size selection (UMC-TSS) is proposed to generate optimal tiled codes and achieve better performance during tiling.The UMC-TSS improves the method of a state-of-the-art TSS to exploit better cache utilization and loop parallelism.Eventually,a source-to-source transformation frame based on the LLVM frontend Clang is developed to transform sequential C/C++ codes to Intel TBB parallel codes.The frame is integrated with dynamic program analysis,coarse-grained parallelism extraction,loop optimizations (including the proposed profitable tiling model and UMC-TSS) and code transformations.It performs high-level code restructuring on the program abstract syntax tree.According to the task graphs,the Intel TBB parallel_for and flow graph templates are used to package the for-loops and irregular code sections into parallel codes respectively.The code transformation is semi-automatic that only a little manual effort and intervention is involved.A series of experiments have been conducted to evaluate the performance of the transformed parallel codes over 18 representative benchmarks selected from 4 different kinds of benchmark suits.The experiment results show that the parallel codes generated by the semi-automatic approach can achieve good parallelism when compared to the parallel codes written by experts,especially the codes with optimized for-loops.The average speedups of for-loops parallelization and task parallelization are 10.95 and 4.45 respectively on an Intel Xeon multi-core server.The correctness of the profitable tiling model is validated as well in the evaluation.The experiment results also show that the UMC-TSS improves the performance of 4％ on average in the tiled loop codes in comparison with a state-of-the-art tile size selection algorithm.The experiment results also show that the generated Intel TBB parallel codes have good scalability when the thread number varies,which demonstrates the effectiveness of the parallelization approach and the source-to-source transformation frame presented in this paper.%多核架构已成为当今的主流,而大量传统的串行程序和遗留软件无法充分利用多核处理器的并行计算性能.人工改写这些遗留软件工作量繁重、成本高昂,自动实现程序并行化的技术成为学术和工业界研究的热点.该文提出了一种新颖的面向一般程序的for循环优化和非规则代码段的粗粒度半自动并行化方法.该方法通过程序动态分析,根据程序的控制流和数据依赖信息将源程序代码映射成可计算单元(CU)图,从中提取出可并行执行的非规则代码段.同时针对程序中for循环部分,提出了一种基于局部性分析的分块收益模型,有效地选择具有收益的循环代码实施循环分块优化;提出了一种基于cache均匀映射的最优分块因子大小选择算法UMC-TSS,以生成优化的分块代码,充分利用cache性能并实现分块的粗粒度并行.该文实现了一个基于LLVM编译架构的C/C++源码到Intel TBB并行源码转换的半自动化工具,它在AST上进行深度代码重构,只需少量的人工干预即可生成高效的并行代码.为了验证该文方法的有效性,从4组不同的基准测试集上选取18个具有代表性的测试程序在一台Intel Xeon多核服务器上进行了一系列实验,在循环级和任务级并行性能上分别获得平均10.95和4.45的加速比.和目前最先进的一种最优分块大小算法相比,UMC-TSS算法平均提升了4％的分块代码性能.实验结果还表明由源到源代码转换工具生成的Intel TBB并行代码具有良好的并行性和可扩展性.

一种面向循环优化和非规则代码段的粗粒度半自动并行化方法

摘要

著录项

相似文献

相关主题

期刊订阅