Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington

WavescalarS. Swanson, et al.Computer Science and Engineering University of Washington Presented by Brett Meyer

ILP in Modern Architecture • Lots of available ILP in software • Execute in parallel for greater performance • Superscalar processors can’t tap it • Serialized by PC • Superscalar doesn’t scale Data-flow approaches can cheaply leverage existing parallelism

Wavescalar • Introduction • WaveCache and Wavescalar ISA • Evaluation and Results • Does WaveCache make sense? • Compiler challenges

Wavescalar: Basics • ALU-in-cache data-flow architecture • No centralized, broadcast-based resources • Compile data-flow binaries

Wavescalar: Waves • Instructions  architecture • Programs broken into waves • Block with single entry • Use wave number to tag data • Disambiguates data from multiple iterations

Wavescalar: Memory • Relaxed program order • Follow control-flow • Obey dependencies • Distributed store buffers • Hardware coherence

Evaluation • WaveCache • 4 MB of on-chip instructions + data, 2K ALUs • WaveCache vs. superscalar • 16-wide OOO, 1K registers, 1K window • WaveCache vs. TRIPS • 4 16-wide in-order cores, 2 MB on-chip cache • Key assumption: perfect memory Fair comparisons? Is it reasonable to assume perfect memory?

Results • WaveCache out-performs superscalar • Similar performance to TRIPS

Memory is the problem, not ILP • Data-flow exposes greater ILP • Memory not fast enough for low-ILP CPUs • Processor-memory performance gap • What does perfect memory hide? • Does superscalar perform better? • Did not model hardware coherence WaveCache needs MORE bandwidth than a superscalar

Is WaveScalar Scalable? • Sub-linear performance improvement • More clusters further away from memory • SPEC, MediaBench fit easily in memory • What happens to performance when the working set doesn’t fit in WaveCache?

Compiler Challenges • Wave identification • Can waves be optimized for performance? • Handling path explosion • 1 BR/5 inst  1050 loaded for 100 executed?

Compiler Challenges • Semi-static instruction placement • Fetch partial/complete waves • Loads/stores close to memory • Clustering neighboring instructions • Reduce coherence traffic

Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington