This paper addresses the issue of parallelizing imperfectly nested loops. Current parallelizing compilers or transformations would either only parallelize the inner-most loop (which is more like vectorization than parallelization), or not parallelize the loops at all. We present an approach that transforms an imperfectly nested loop into at most three fully parallel perfectly nested loops. The transformed loops can be parallelized by any parallelizing compiler. The advantage of our technique is the simplicity of the transformed loops and low synchronization overhead. The feasibility of this approach was tested using several types of loops including those from the Eispack math library and from Linpack benchmark on different multi-processor platforms and performance was compared with Sun's MP C and Cray's autotasking. The results show that our method is very effective.
展开▼