The multi-core architectures are nowadays characterized by Non-Uniform Memory Access (NUMA). Efficiently exploiting such architectures is extremely complicated for programmers. Multi-threaded programs may encounter high memory access latency if the mapping of data and computing is not considered carefully on such systems. Programmers need tools to detect performance problems if high memory access latency occurs. To address this need, we present a profiling tool called LaProf, which uses memory access latency information to detect performance problems on NUMA systems. This tool can be used to detect three performance problems of multi-threaded programs, which are: 1) data sharing. Shared data will cause remote memory access if threads which access the shared data are not allocated on the same node of NUMA systems; 2) shared resource contention. High memory access latency will influence the performance severely if contention happens on shared resources, such as last-level caches, inter-connect links and memory controllers; 3) remote access imbalance. The thread which has the most number of remote data access becomes the critical thread which lags down the overall performance of multi-threaded program. After the detection done by LaProf, using simple and general NUMA optimization techniques, the performance improvement for each problem is 88%, 32%, 99% respectively.
展开▼