Pre-Silicon Simulation of Multi-Core Benchmarks

Pre-Silicon Simulation of Multi-Core Benchmarks Shubu Mukherjee Principal Engineer Director, SPEARS Group Intel Corporation Panel in Symposium on Workload Characterization, Sep 27, 2007

Detailed Model Good for Core Analysis Socket • Single core simulation model executes ~ 12 milliseconds of a real machine’s execution • Assumes core speed = 1 KIPS (kilo simulated insts per second) • Assumes each simulation run is about 10 hours Core Uncore

Four-Socket Platform Model Too Slow • 1-socket simulation model executes ~ 1-3 milliseconds of a real machine’s execution • 4-socket simulation model executes only 100s of microseconds of a real machine’s execution (recall disk latency is in milliseconds) Need at least a 10x Boost in Platform Performance Model Speed

What 10x Speed Improvement Gives Us? • Improved Accuracy • Via greater coverage of benchmark slices • Better glassjaw analysis Faster Turnaround • Improved Latency • Faster debugging Improved Benchmarking • Greater coverage of benchmarks • Enables multithreaded (cooperative) benchmarks

Approaches to Boost Simulation Speed(one key charter for SPEARS) •  Improve Basic Infrastructure •  Create Faster Core Models That are Less Accurate •  Go Parallel in a Modular Fashion • Use Accelerators, such as FPGAs

What’s Novel Here? • Parallel Simulation is an Old Technology • Distributed, discrete-event simulation, Fujimoto, 1990 • Wisconsin Wind Tunnel I + II, Reinhardt, et al 1992 & Mukherjee, et al. 1997 • Customized for specific applications (e.g., shared memory) So, What Are the Challenges? • Starting point is several millions of lines of non-parallel C++ code (!) • This is production software  must be stable (unlike “research” software) • Parallel infrastructure must be modular, built once, used repeatedly without changing any architecture model code • Deal with new problems: load imbalance at multiple levels Current Status: Created infrastructure, Work-In-Progress

Speedup of the Pthread-per-socket Model(on Clovertowns) • Speedup scales linearly with problem size • LOT more room for improvement exists

Pre-Silicon Simulation of Multi-Core Benchmarks