70 likes | 92 Views
Explore strategies to enhance speed and accuracy of pre-silicon simulation for multi-core benchmarks. Learn about challenges, novel approaches, and ongoing efforts to achieve significant speedup. Current focus on Pthread-per-socket model shows promise for further enhancement.
E N D
Pre-Silicon Simulation of Multi-Core Benchmarks Shubu Mukherjee Principal Engineer Director, SPEARS Group Intel Corporation Panel in Symposium on Workload Characterization, Sep 27, 2007
Detailed Model Good for Core Analysis Socket • Single core simulation model executes ~ 12 milliseconds of a real machine’s execution • Assumes core speed = 1 KIPS (kilo simulated insts per second) • Assumes each simulation run is about 10 hours Core Uncore
Four-Socket Platform Model Too Slow • 1-socket simulation model executes ~ 1-3 milliseconds of a real machine’s execution • 4-socket simulation model executes only 100s of microseconds of a real machine’s execution (recall disk latency is in milliseconds) Need at least a 10x Boost in Platform Performance Model Speed
What 10x Speed Improvement Gives Us? • Improved Accuracy • Via greater coverage of benchmark slices • Better glassjaw analysis Faster Turnaround • Improved Latency • Faster debugging Improved Benchmarking • Greater coverage of benchmarks • Enables multithreaded (cooperative) benchmarks
Approaches to Boost Simulation Speed(one key charter for SPEARS) • Improve Basic Infrastructure • Create Faster Core Models That are Less Accurate • Go Parallel in a Modular Fashion • Use Accelerators, such as FPGAs
What’s Novel Here? • Parallel Simulation is an Old Technology • Distributed, discrete-event simulation, Fujimoto, 1990 • Wisconsin Wind Tunnel I + II, Reinhardt, et al 1992 & Mukherjee, et al. 1997 • Customized for specific applications (e.g., shared memory) So, What Are the Challenges? • Starting point is several millions of lines of non-parallel C++ code (!) • This is production software must be stable (unlike “research” software) • Parallel infrastructure must be modular, built once, used repeatedly without changing any architecture model code • Deal with new problems: load imbalance at multiple levels Current Status: Created infrastructure, Work-In-Progress
Speedup of the Pthread-per-socket Model(on Clovertowns) • Speedup scales linearly with problem size • LOT more room for improvement exists