120 likes | 246 Views
Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington. Presented by Brett Meyer. ILP in Modern Architecture. Lots of available ILP in software Execute in parallel for greater performance Superscalar processors can’t tap it Serialized by PC
E N D
WavescalarS. Swanson, et al.Computer Science and Engineering University of Washington Presented by Brett Meyer
ILP in Modern Architecture • Lots of available ILP in software • Execute in parallel for greater performance • Superscalar processors can’t tap it • Serialized by PC • Superscalar doesn’t scale Data-flow approaches can cheaply leverage existing parallelism
Wavescalar • Introduction • WaveCache and Wavescalar ISA • Evaluation and Results • Does WaveCache make sense? • Compiler challenges
Wavescalar: Basics • ALU-in-cache data-flow architecture • No centralized, broadcast-based resources • Compile data-flow binaries
Wavescalar: Waves • Instructions architecture • Programs broken into waves • Block with single entry • Use wave number to tag data • Disambiguates data from multiple iterations
Wavescalar: Memory • Relaxed program order • Follow control-flow • Obey dependencies • Distributed store buffers • Hardware coherence
Evaluation • WaveCache • 4 MB of on-chip instructions + data, 2K ALUs • WaveCache vs. superscalar • 16-wide OOO, 1K registers, 1K window • WaveCache vs. TRIPS • 4 16-wide in-order cores, 2 MB on-chip cache • Key assumption: perfect memory Fair comparisons? Is it reasonable to assume perfect memory?
Results • WaveCache out-performs superscalar • Similar performance to TRIPS
Memory is the problem, not ILP • Data-flow exposes greater ILP • Memory not fast enough for low-ILP CPUs • Processor-memory performance gap • What does perfect memory hide? • Does superscalar perform better? • Did not model hardware coherence WaveCache needs MORE bandwidth than a superscalar
Is WaveScalar Scalable? • Sub-linear performance improvement • More clusters further away from memory • SPEC, MediaBench fit easily in memory • What happens to performance when the working set doesn’t fit in WaveCache?
Compiler Challenges • Wave identification • Can waves be optimized for performance? • Handling path explosion • 1 BR/5 inst 1050 loaded for 100 executed?
Compiler Challenges • Semi-static instruction placement • Fetch partial/complete waves • Loads/stores close to memory • Clustering neighboring instructions • Reduce coherence traffic