Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one an-other to generate the maximum performance of an al-gorithm. Too small or large a block size makes getting good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. We present PoLAPACK factorization roulines, in-cluding LU, QR, and Cholesky factorizations, with an "algorithmic blocking" on 2-dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irre-spective of the physical block size. The routines are implemented on the SGI/Cray T3E and compared with the corresponding ScaLAPACK pactorization routines.
展开▼