Both inherently sequential code and limitations of analysistechniques prevent full parallelization of many applications byparallelizing compilers. Amdahl's Law tells us that as parallelizationbecomes increasingly effective, any unparallelized loop becomes anincreasingly dominant performance bottleneck. We present a technique forspeeding up the execution of unparallelized loops by cascading theirsequential execution across multiple processors: only a single processorexecutes the loop body at any one time, and each processor executes onlya portion of the loop body before passing control to another. Cascadedexecution allows otherwise idle processors to optimize their memorystate for the eventual execution of their next portion of the loop,resulting in significantly reduced overall loop body execution times. Weevaluate cascaded execution using loop nests from wave5, a Spec95fpbenchmark application, and a synthetic benchmark. Running on a PC with 4Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors,we observe an overall speedup of 1.35 and 1.7, respectively, for thewave5 loops we examined and speedups as high as 4.5 for individualloops. Our extrapolated results using the synthetic benchmark show apotential for speedups as large as 16 on future machines
展开▼