Computations from many scientific and engineering domains use irregular meshes and/or sparse matrices. The codes expressing these computations involve irregular reductions. The main characteristics of irregular reduction loops are 1) elements of left-hand-side arrays may be incremented in multiple iterations of the loop, but only using associative and commutative operations (these arrays are called reduction arrays), 2) there are no loop carried dependencies, except on elements of reduction arrays, and 3) one or more arrays are accessed using indirection arrays. It is very challenging to efficiently parallelize codes involving irregular reductions, especially on large parallel machines. Because of accesses through indirection arrays, communication and locality are hard to manage. Not only is the total communication volume large, but the communication requirements typically cannot be determined at compile time. It is also hard to efficiently allocate space for non-local elements. Particularly, there are no effective solutions for parallelization of adaptive irregular reductions. In an adaptive irregular reduction, the elements of the indirection arrays are modified after every few iterations. This significantly increases the overhead associated with partitioning and runtime preprocessing routines [7,11], which have been critical for achieving locality, communication efficiency, and effective buffer management. Recently, there has been much interest in multithreaded architectures. A multiprocessor based upon a multithreaded architecture supports multiple threads of execution on each processor. These architectures also support low-cost thread initiation, low-overhead communication, and efficient communication and synchronization between threads on different processors. Multithreaded architectures are considered a promising medium for scalable parallelization of irregular applications, where the frequent communication and synchronization make parallelization hard on conventional parallel machines. We have developed an execution strategy for irregular reductions on a multithreaded architecture. The key idea in our execution model is that the frequency and volume of communication is independent of the contents of the indirection arrays. Thus, unlike other approaches to scalable parallelization of irregular reductions, our approach does not require mesh partitioning [1], array renumbering [6], or a high-cost inspector that itself requires communication between processors [7]. The performance depends upon the architecture's ability to support low-cost communication and overlap communication and computation, and is largely independent of the problem partitioning. Thus, the same performance can be obtained on adaptive problems, without paying the high overhead of partitioning frequently.
展开▼