Minimizing communications when mapping affine loop nests onto distributed memory parallel computers has already drawn a lot of attention. We focus on the next step: as it is generally impossible to obtain a communication-free (or local) mapping, how to optimize the residual communications? We explain how to take advantage of macro-communications such as broadcasts, scatters, gathers or reductions or how to decompose general affine communications into simpler ones that can be performed more efficiently. We finally give a two-step heuristic that summarizes our approach: first minimize the number of nonlocal communications, then optimize residual affine communications using macro-communications or decompositions.
展开▼