Shows that instruction-level parallelism (ILP) and data-levelparallelism (DLP) can be merged in a single simultaneous vectormultithreaded architecture to execute regular vectorizable code at aperformance level that cannot be achieved using either paradigm on itsown. We show that the combination of the two techniques yields very highperformance at a low cost and a low complexity. We show that thisarchitecture achieves a sustained performance on numerical regular codesthat is 20 times the performance that can be achieved with today'ssuperscalar microprocessors. Moreover, we show that the architecture cantolerate very large memory latencies, of up to a 100 cycles, with arelatively small performance degradation. This high performance isindependent of working set size or of locality considerations, since theDLP paradigm allows very efficient exploitation of a high-performanceflat memory bandwidth
展开▼