We evaluate optimized parallel sparse matrix-vector operations for severalrepresentative application areas on widespread multicore-based clusterconfigurations. First the single-socket baseline performance is analyzed andmodeled with respect to basic architectural properties of standard multicorechips. Beyond the single node, the performance of parallel sparse matrix-vectoroperations is often limited by communication overhead. Starting from theobservation that nonblocking MPI is not able to hide communication cost usingstandard MPI implementations, we demonstrate that explicit overlap ofcommunication and computation can be achieved by using a dedicatedcommunication thread, which may run on a virtual core. Moreover we identifyperformance benefits of hybrid MPI/OpenMP programming due to improved loadbalancing even without explicit communication overlap. We compare performanceresults for pure MPI, the widely used "vector-like" hybrid programmingstrategies, and explicit overlap on a modern multicore-based cluster and a CrayXE6 system.
展开▼