Despite NUMA aware optimizations are often considered not portable, this paper states that extending a compiler, supporting compilation of parallel APIs, with NUMA-aware optimizations, significantly improves performance and energy consumption on NUMA systems, while for UMA systems, NUMA-aware optimizations do not degrade the performance, unless the overhead of calling the mapping functions is significantly bigger than the improvement produced by the optimizations. This paper introduces NUMA-BTLP algorithm, a compile-time optimization for LLVM compiler, which decides the type of each thread in the program code as a result of a static analysis of the code. NUMA-BTLP calls NUMA-BTDM algorithm which uses specific PThreads routines to set the CPU affinities of the threads (i.e. thread-core association) depending on their type returned by NUMA-BTLP. The algorithms improve thread and data mapping on NUMA systems by executing threads that share data on the same core(s), allowing fast access to LI cache data. The paper proves that task based parallel code which uses PThreads and which may contain shared-memory parallel loops (LLVM has support for both task and loop parallelism through PThreads library and OpenMP extension, respectively), is time and energy efficient at runtime when optimized using the two algorithms. However, the algorithms are expected to produce runtime energy improvements only on NUMA systems based on the energy model with constant energy consumption or on the energy model in which each core is powered from a separate source.
展开▼