In our previous work, we demonstrated the possible performance gains from update-based cache coherence protocols for a set of fine-grain scientific applications running on a scalable shared-memory multiprocessor. In this paper, we examine in detail the hardware-based write grouping scheme presented in our earlier work. First we describe both software-based and hardware-based write grouping schemes. The software-based scheme, with its perfect knowledge of the application's write pattern, is able to achieve optimal write grouping efficiency, but not without added complexity to the application's code. Nevertheless, we use the software-based scheme to determine the optimal grouping efficiency for each application studied and then demonstrate that the hardware-based write grouping scheme is almost as efficient as the software-based scheme, but it requires little, if any, software modifications.
展开▼