Statistical Simulation of Superscalar Architectures using Commercial Workloads

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information Systems (ELIS) Ghent University, Belgium CAECW’01, January 21, 2001

Outline • Introduction • Statistical Simulation • Statistical profiling • Synthetic trace generation • Methodology • Evaluation • Conclusion

Introduction • Architectural simulation • trace-driven or execution-driven • accurate • long simulation times • long traces to be stored • Need for fast simulation techniques • take part of a full trace • analytical modeling • trace sampling • statistical simulation

Goal • Previous work used SPEC benchmarks to evaluate statistical simulation • In this talk we use both commercial and scientific workloads • SPECint, SPECfp, system traces, multimedia, X graphics, database

Statistical Simulation • Three steps: • extract statistical profile from a program execution • generate synthetic trace from it • simulate on a trace-driven simulator • Two major advantages: • statistical profile is more compact than full trace • fast simulation due to statistical nature • design space exploration in limited time

statistical profile synthetic trace generator synthetic trace trace-driven simulator Statistical Simulation real trace (e.g. SPEC benchmark) branch profiling cache profiling instruction profiling branch statistics cache statistics instruction statistics

Statistical Profiling • Microarchitecture-independent statistics • instruction statistics • Microarchitecture-dependent statistics • branch statistics • cache statistics • Result: statistical simulation only to explore design options of processor core (cache and branch predictor are fixed)

Statistical ProfilingInstruction Statistics • Instruction mix (13 classes) • Number of register operands • Age of register operands • probability that register operand was produced  instructions before it in the trace (only RAW) • Memory dependencies • probability that load is memory-dependent on the -th store before it in the trace (only RAW)

Statistical ProfilingBranch Statistics • Six branch types • conditional branch, unconditional branch, call with offset, indirect jump, indirect call, return • Distinction • branch prediction accuracy: refill pipeline on branch misprediction • branch target prediction accuracy: single-cycle bubble in pipeline on correct branch prediction but target misprediction

Statistical ProfilingCache Statistics • D-cache statistics • L1 D-cache miss rate • L2 D-cache miss rate • I-cache statistics • L1 I-cache miss rate • L2 I-cache miss rate

st add ld br Synthetic Trace Generation • Instruction-by-instruction • through random number generation • Determine • instruction type • number of operands • age of register operands • memory dependency • branch behavior • D-cache behavior • I-cache behavior I-cache miss D-cache miss mispredicted

Methodology: microarchitecture • Out-of-order processor • 8 and 16 issue • windows of 64 and 128 instructions • McFarling branch predictor • ‘small’ cache configuration • 8KB DM L1 I-cache, 8KB DM L1 D-cache, 64KB 2WSA unified L2 cache • ‘large’ cache configuration • 32KB DM L1 I-cache, 64KB 2WSA L1 D-cache, 512KB 4WSA unified L2 cache • Access time • L1 I-cache (1 cycle), L1 D-cache (2 cycles), L2 cache (10 cycles), main memory (80 cycles)

Methodology: benchmarks • 8 SPECint95 benchmarks • 5 SPECfp95 benchmarks (hydro2d, su2cor, swim, tomcatv, wave5) • 8 IBS system traces (mpeg, jpeg, gs, verilog, gcc, sdet, nroff, groff) • 4 MediaBench applications (g721, gs, gsm, mpeg2) • 4 X graphics benchmarks (DooM, POVRay, Xanim, Quake) • 2 TPC-D queries running on Postgres 6.3 • ~ 200 million instructions / trace

Evaluation • IPC prediction error = IPC real trace - IPC synthetic trace IPC real trace • IPC real trace = IPC when running real trace on trace-driven simulator • IPC synthetic trace = IPC when running synthetic trace generated from the statistical profile of the real trace • Simulation speed: sIPC/xIPC less than 1% after simulating 1 million instructions

IPC prediction error (1) high D-cache miss rate 157% 135% 40% 30% 20% 10% IPC prediction error 0% -10% -20% -30% li go gs gs perl jpeg sdet gcc ijpeg nroff groff verilog gsm_e swim mpeg2 xanim mpeg tpc-d.2 vortex wave5 su2cor xdoom xquake xpovray g721_e hydro2d tomcatv tpc-d.17 real_gcc m88ksim compress SPECint95 SPECfp95 IBS MediaBench X graphics TPC-D 16-issue, 128-entry window, ‘small’ cache configuration

IPC prediction error (2) 30% 20% 10% IPC prediction error 0% -10% -20% -30% li go gs gs jpeg gcc sdet ijpeg perl groff nroff swim verilog gsm_e mpeg mpeg2 xanim vortex tpc-d.2 wave5 xquake su2cor xdoom g721_e xpovray tomcatv tpc-d.17 real_gcc hydro2d m88ksim compress SPECint95 SPECfp95 IBS MediaBench X graphics TPC-D 16-issue, 128-entry window, ‘large’ cache configuration

IPC prediction error vs. static instruction count 160% w = 64; i = 8; 'small' cache 140% w = 128; i = 16; 'small' cache 120% w = 64; i = 8; 'large' cache nroff jpeg (IBS) verilog sdet 100% w = 128; i = 16; 'large' cache 80% mpeg (IBS) groff gcc DooM Quake gs (IBS) IPC prediction error 60% 40% 20% 0% gcc (IBS) vortex go TPC-D -20% -40% 0 20000 40000 60000 80000 100000 120000 140000 160000 static instruction count (number of instructions executed at least once)

Conclusion (1) • Higher IPC prediction errors for applications with smaller static instruction count: • MediaBench applications • SPECfp95 benchmarks • 2 X graphics benchmarks (POVRay and Xanim) • 5 SPECint95 benchmarks

Conclusion (2) • Smaller IPC prediction errors for applications with larger instruction footprint: • IBS system traces • TPC-D traces • 2 X graphics benchmarks (DooM and Quake) • 3 SPECint95 benchmarks (go, gcc, vortex) • IPC prediction error between -1% and 25%

Conclusion (3) • Statistical simulation is a useful fast simulation technique for commercial workloads • due to higher variability in instructions • since commercial workloads have larger instruction footprint • which makes a statistical technique more powerful

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Presentation Transcript

Superscalar and VLIW Architectures

Memory System Characterization of Commercial Workloads

TCAD Simulation of irradiated Silicon radiation detector using commercial simulation products

TCAD Simulation of Silicon radiation detectors using commercial simulation products

Type of Workloads

Using Simulation Methods to Introduce Statistical Inference

Designing a superscalar processor simulation

Standards and Statistical Production Architectures

Simulation Evaluation of Web Caching Architectures

Superscalar Processor Design Superscalar Architecture

Lock Behaviour Characterization of Commercial Workloads

Statistical Analysis of Packet Buffer Architectures

Discrete Event Modeling and Simulation of Distributed Architectures using the DSSV Methodology

Analytical Evaluation of Shared-Memory Systems with Commercial Workloads

Types of Workloads

Superscalar Pipeline Architectures

Economic Plantwide Control using Commercial Process Simulation Software

Workloads

Memory System Characterization of Commercial Workloads

Inherently Lower-Power High-Performance Superscalar Architectures

Simulation concepts and architectures

Simulation Evaluation of Web Caching Architectures