The emerging Symmetric Multi-Processing (SMP) architectures have shifted from the Shared- Bus topology to Cache-coherent Non-Uniform Memory Access (ccNUMA), where the Pro- cessing Elements (PEs) can access the distributed memories with different delays. This shift potentially impacts the performance of current SMP software packages and the way they address complexity of the new architecture. In this work, we compare our User-level thread scheduling mechanism [16] with OpenMP scheduler to multiply two large matrices on a Dual-socket NUMA architecture. We analyzed and evaluated an optimized and multi- threaded implementation of Level-3 BLAS general matrix multiplication routine (DGEMM). We have shown that addressing the architectural awareness for such a memory intensive op- eration would minimize the memory bottlenecks and improve the utilization of the memory caches and consequently, the overall performance. In this work, we show one way of threads scheduling and data alignment that can reduce the number of cache-misses down to one-third of the cache-misses from the non-tuned implementation and reduce the required computation time by up to 22%. Finally, we show the relationship between the number of cache-misses and the gained speedup percentage of our implementation, which proves our hypothesis about the data locality problem and memory bottleneck in a non NUMA-aware implementation.
展开▼