High-Level Simulator Architecture Design of simulator mirrors design of VIRAM-1 chip

The Mark-II Performance Simulator for VIRAM-1Gagan Prakash, Brian GaekeCS 252 Spring 2001http://www-inst.eecs.berkeley.edu/~brg/vsimIIbrg@eecs.berkeley.edugagpcool@hkn.eecs.berkeley.edu

Problems in Performance Simulation(Simulation)Runtime Does Matter! Current performance simulator => 1500X slowdown Many problems now assume "normal" VIRAM-1 chip configWhen simulator was designed, "normal" chip not known Lots of parameters no longer neededMany computation-intensive datasets cannot be simulated Software architecture of current simulator non-portableCurrent simulator no longer maintained Out of date with respect to simpler (functional) simulator Can't use today's machines (180 MHz fastest simulation machine)

Solutions for Simulation Trace-based cycle-level simulatorTraces from actively-maintained functional simulator No more version skew! Emphasis on portability Faster simulation machines => faster simulations Streamline parametrizationMake it look like the "normal" VIRAM-1 chip Restrict parametrization Potential pitfall: traces are hugeSupport compressed traces

High-Level Simulator ArchitectureDesign of simulator mirrors design of VIRAM-1 chip Software Units Lexer & Parser Performance Analyzer Control Unit and queue Issue Unit and queue Functional Units

Low-Level Simulator ArchitectureFunctional Units (FUs) Memory Functional Unit Memory system Translation system Flag Functional Unit Arithmetic Functional Units: 1 Int+FP, 1 Int only Element group queues

Wall clock time to simulate

Peak simulator memory usage Measured resident size using ps (pages actually touched)

Predicted cycle countMeasuring inner loops only Percent Difference : Update: 13% Transitive: 5% Pointer: 17%

Project SuccessesUseful parametrizations! Lanes, banks/subbanks, memory size Reduced simulator memory size Lots of simple optimizations Don't simulate empty queues Retire no-ops early Reduced implementation complexity 7,500 LOC vs 117,000 LOC

Project "Not-So-Successes"Cycle-level simulationMemory FU resolves hazards per-element-group Element groups from many instructions in any cycle Interlocks between memory unit and other FUs Control/issue unit simulations basically trivialTrace size Small traces range from 50 - 250 megs Simulator spends 70 - 95% of time in I/OMemory system: Implementor information starvation!Memory bandwidth numbers are unavailable TLB undocumentedScalar core????

Conclusions and Future WorkProgram dependent average analysis Multiple idealized modelsEach with a queue model and a few typical kernels Could enable multicycle simulation You need a general simulator to enable this, thoughCut the fat out of the old simulator Port it to other platforms?Exception modeling untouched "We still don't have an OS" Software-managed TLB effects unknownIs this simulation really better? (Hennessy)

What We Learned Leverage Existing Work First!Why rewrite when you can port, extend, or document…? Need extremely detailed docs to write simulator A good simulator can be documentation… Need access to random notes, not just theses Emphasize leaving behind good docs when you graduate? Devising good approximations for complex HW is a black art But… approximations are indispensible Trading off accuracy vs. complexity Experiment with compilers and standard librariesPortability and efficiency

Backup Slides

Why We Ditched Multicycle Simulation1. Finding register file structural hazards requires per-cycle Suppose full pipeline... Every cycle, some FU is doing a reg read Could cause structural hazard w/ first memory unit stage 2. Memory unit must be synched with other FUs Memory unit controls other units' stalls To figure out whether other units can go ahead… Need all the details of memory unit state per pipeline stage 3. Added overhead of multicycle  Amdahl's Law Simplify implementation by always assuming single cycle

Compiler EffectsFallacy: The compiler that understands the language better produces the faster code. Stepanov Abstraction Penalty Benchmark Measures speedup of C++ library algos/data abstractions versus naive (FORTRAN-like) hand coded loops On same floating point vector kernel You pay 2.3x in runtime for using a smarter compiler

Library EffectsPitfall: Relying on standard library for programmer efficiency. Surprises in profile for early version: Lib calls (string) and object constructors??? When you are dealing with 200MB traces you want to be I/O bound. Workaround: Don't use objects Make everything extern "C" {...} Use C strcpy instead of C++ string::assignResult: Time in I/O reduced from 95% to 70%

Ideal Simulator Construction Experience User selectable multiple levels of detail Having a detailed understanding of processor first Access to documentation, notes Information about design decisions A better mix of C and C++ Well defined input format (parsing traces is Evil) Component framework for simulator construction Standardize interface between pieces: RTL, coarse-grained cycle, q'ing theory,memory interface, hand hacked... custom RTL Queue model Queue model

High-Level Simulator Architecture Design of simulator mirrors design of VIRAM-1 chip