The parallelization of numerical algorithms is very important in scientific applications, but many points of this parallelization remain open today. Specifically, the overhead introduced by loading and unloading the data degrades the efficiency, and in a realistic approach should be taking into account for performance estimation. The authors of this paper present a way of overcoming the bottleneck of loading and unloading the data by overlapping computations and communications in a specific algorithm such as matrix-vector multiplication. Also, a way of mapping this algorithm in hardware is presented in order to demonstrate the parallelization methodology.
展开▼