360 likes | 474 Views
Introduction to SimpleScalar (Based on SimpleScalar Tutorial). CSCE614 Texas A&M University. Overview. What is an architectural simulator a tool that reproduces the behavior of a computing device Why use a simulator Leverage a faster, more flexible software development cycle
E N D
Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Texas A&M University
Overview • What is an architectural simulator • a tool that reproduces the behavior of a computing device • Why use a simulator • Leverage a faster, more flexible software development cycle • Permit more design space exploration • Facilitates validation before H/W becomes available • Level of abstraction is tailored by design task • Possible to increase/improve system instrumentation • Usually less expensive than building a real system
Advantages of SimpleScalar • Highly flexible • functional simulator + performance simulator • Portable • Host: virtual target runs on most Unix-like systems • Target: simulators can support multiple ISAs • Extensible • Source is included for compiler, libraries, simulators • Easy to write simulators • Performance • Runs codes approaching ‘real’ sizes
Architectural Simulators Functional Performance Trace-Driven Exec-Driven Inst Schedulers Cycle Timers Interpreters Shaded tools are included in SimpleScalar Tool Set Simulation Tools DirectExecution
Functional vs. Performance Simulators • Functional simulators implement the architecture • perform real execution • Implement what programmers see • Performance simulators implement the microarchitecture • Model system resources/internals • Concern about time • Do not implement what programmers see
Trace-Driven Simulator reads a ‘trace’ of the instructions captured during a previous execution Easy to implement No functional components necessary No feedback to trace (eg. mis-prediction) Execution-Driven Simulator runs the program (trace-on-the-fly) Hard to implement Advantages Faster than tracing No need to store traces Register and memory values usually are not in trace Support mis-speculation cost modeling Trace Driven vs. Execution Driven Simulators
Instruction Schedulers vs. Cycle Timers • Instruction Schedulers • Simulator schedules instruction when resources are available • Instructions proceeded one at a time • Simpler, but less detailed • Cycle Timers • Simulator tracks microarch. state each cycle • Simulator state == microarchitecture state • Perfect for microarchitecture simulation
SimpleScalar Release 3.0 • SimpleScalar now executes multiple instruction sets: SimpleScalar PISA (the old "SimpleScalar ISA") and Alpha AXP. • All simulators now support external I/O traces (EIO traces). Generated with a new simulator (sim-eio) • Support more platforms • explicit fault support • And many more
Simulator Suite Sim-Fast Sim-Safe Sim-Profile Sim-Cache Sim-Cheetah Sim-BPred Sim-Outorder • 300 lines • functional • 4+ MIPS • 350 lines • functional w/checks • 900 lines • functional • Lot of stats • < 1000 lines • functional • Cache stats • Branch stats • 3900 lines • performance • OoO issue • Branch pred. • Mis-spec. • ALUs • Cache • TLB • 200+ KIPS Performance Detail
Sim-Fast • Functional simulation • Optimized for speed • Assumes no cache • Assumes no instruction checking • Does not support Dlite! • Does not allow command line arguments • <300 lines of code
Sim-Safe • Functional simulation • Checks for instruction errors • Optimized for speed • Assumes no cache • Supports Dlite! • Does not allow command line arguments
Sim-Cache • Cache simulation • Ideal for fast simulation of caches (if the effect of cache performance on execution time is not necessary) • Accepts command line arguments for: • level 1 & 2 instruction and data caches • TLB configuration (data and instruction) • Flush and compress • and more • Ideal for performing high-level cache studies that don’t take access time of the caches into account
Sim-Cache (cont'd) • generates one- and two-level cache hierarchy statistics and profiles • extra options (also supported on sim-outorder): -cache:dl1 <config> - level 1 data cache configuration -cache:dl2 <config> - level 2 data cache configuration -cache:il1 <config> - level 1 instruction cache configuration -cache:il2 <config> - level 2 instruction cache configuration -tlb:dtlb <config> - data TLB configuration -tlb:itlb <config> - instruction TLB configuration -flush <config> - flush caches on system calls -icompress - remaps 64-bit inst addresses to 32-bit equiv. -pcstat <stat> - record statistic <stat> by text address
Specifying Cache Configurations • all caches and TLB configurations specified with same format: <name>:<nsets>:<bsize>:<assoc>:<repl> • where: <name> - cache name (make this unique) <nsets> - number of sets <assoc> - associativity (number of “ways”) <repl> - set replacement policy l - for LRU f - for FIFO r - for RANDOM • examples: il1:1024:32:2:l 2-way set-assoc 64k-byte cache, LRU dtlb:1:4096:64:r 64-entry fully assoc TLB w/ 4k pages,random replacement
Sim-Bpred • Simulate different branch prediction mechanisms • Generate prediction hit and miss rate reports • Does not simulate the effect of branch prediction on total execution time nottaken taken perfect bimod bimodal predictor 2lev 2-level adaptive predictor comb combined predictor (bimodal and 2-level)
Sim-Profile • Program Profiler • Generates detailed profiles, by symbol and by address • Keeps track of and reports • Dynamic instruction counts • Instruction class counts • Branch class counts • Usage of address modes • Profiles of the text & data segment
Sim-Outorder • Most complicated and detailed simulator • Supports out-of-order issue and execution • Provides reports • branch prediction • cache • external memory • various configuration
Sim-Outorder: Detailed Performance Simulator • generates timing statistics for a detailed out-of-order issue processor core with two-level cache memory hierarchy and main memory • extra options: -fetch:ifqsize <size> - instruction fetch queue size (in insts) -fetch:mplat <cycles> - extra branch mis-prediction latency (cycles) -bpred <type> - specify the branch predictor -decode:width <insts> - decoder bandwidth (insts/cycle) -issue:width <insts> - RUU issue bandwidth (insts/cycle) -issue:inorder - constrain instruction issue to program order -issue:wrongpath - permit instruction issue after mis-speculation -ruu:size <insts> - capacity of RUU (insts) -lsq:size <insts> - capacity of load/store queue (insts) -cache:dl1 <config> - level 1 data cache configuration -cache:dl1lat <cycles> - level 1 data cache hit latency
Sim-Outorder: Detailed Performance Simulator -cache:dl2 <config> - level 2 data cache configuration -cache:dl2lat <cycles> - level 2 data cache hit latency -cache:il1 <config> - level 1 instruction cache configuration -cache:il1lat <cycles> - level 1 instruction cache hit latency -cache:il2 <config> - level 2 instruction cache configuration -cache:il2lat <cycles> - level 2 instruction cache hit latency -cache:flush - flush all caches on system calls -cache:icompress - remap 64-bit inst addresses to 32-bit equiv. -mem:lat <1st> <next> - specify memory access latency (first, rest) -mem:width - specify width of memory bus (in bytes) -tlb:itlb <config> - instruction TLB configuration -tlb:dtlb <config> - data TLB configuration -tlb:lat <cycles> - latency (in cycles) to service a TLB miss
Sim-Outorder: Detailed Performance Simulator -res:ialu - specify number of integer ALUs -res:imult - specify number of integer multiplier/dividers -res:memports - specify number of first-level cache ports -res:fpalu - specify number of FP ALUs -res:fpmult - specify number of FP multiplier/dividers -pcstat <stat> - record statistic <stat> by text address -ptrace <file> <range> - generate pipetrace
Specifying the Branch Predictor • specifying the branch predictor type: -bpred <type> • the supported predictor types are: nottaken always predict not taken taken always predict taken perfect perfect predictor bimod bimodal predictor (BTB w/ 2 bit counters) 2lev 2-level adaptive predictor • configuring the bimodal predictor (only useful when “-bpredbimod” is specified): -bpred:bimod <size> size of direct-mapped BTB
Specifying the Branch Predictor (cont'd) • configuring the 2-level adaptive predictor (only useful when “-bpred 2lev” is specified): -bpred:2lev <l1size> <l2size> <hist_size> <xor> Configurations: N, M, W, X N:# entries in first level (# of shift register(s)) M:# entries in 2nd level (# of counters, or other FSM) W:width of shift register(s) (# of bits in each shift register) X:(yes-1/no-0) xor history (We use 0 for this homework.) and address for 2nd level index Sample predictors: GAg: 1,M,W,0 where M = 2^W GAp: 1,M,W,0 where M = C*2^W, C is # of per-address prediction tables PAg: N,M,W,0 where M = 2^W PAp: N,M,W,0 where M = N * 2^W
Performance Comparison of GAg,GAp, PAg and PAp • GAp: 1 global history register and 8 per-address prediction tables Branch address 4 2-bits per branch predictor Prediction 2-bit global branch history (b) (2,2) predictor (a) GAp
Hack the state machine of Branch Predictor! (a) A3 (Same as shown in the textbook) (b) A2 (Original Simplescalar Implementation)
Sim-Outorder HW Architecture Register Scheduler Exe Writeback Commit Fetch Dispatch Mem Memory Scheduler I-Cache I-TLB D-Cache D-TLB Virtual Memory
Sim-Outorder (Main Loop) • sim_main() insim-outorder.c ruu_init(); for(;;){ ruu_commit(); ruu_writeback(); lsq_refresh(); ruu_issue(); ruu_dispatch(); ruu_fetch(); } • Executed once for each simulated machine cycle • Walks pipeline from Commit to Fetch • Reverse traversal handles inter-stage latch synchronization by only one pass
Sim-Outorder (RUU/LSQ) • RUU (Register Update Unit) • Handles register synchronization/communication • Serves as reorder buffer and reservation stations • Performs out-of-order issue when register and memory dependences are satisfied • LSQ (Load/Store Queue) • Handles memory synchronization/communication • Contains all loads and stores in program order • Relationship between RUU and LSQ • Memory dependencies are resolved by LSQ • Load/Store effective address calculated in RUU
Sim-Outorder: Fetch • ruu_fetch() • Modelsmachine fetch bandwidth • Fetches instructions from one I-cache/memory • block until I-cache misses are resolved • Instructions are put into the instruction fetch queuenamed fetch_data in sim-outorder.c (it is also called dispatch queue in the paper) • Probesbranch predictor to obtain the cache line for next cycle
Sim-Outorder: Dispatch • ruu_dispatch() • Models instruction decoding and register renaming • Takes instructions from fetch_data • Decodes instructions • Enters and links instructions into RUU and LSQ • Splits memory operations into two separate instructions
Sim-Outorder: Scheduler • lsq_refresh() • Models instruction selection, wakeup and issue • Separate schedulers track register and memory dependences. • Locates instructions with all register inputs ready and all memory inputs ready • Issue of ready loads is stalled if there is a store with unresolved effective address in LSQ. • If earlier store address matches load address, target value is forwarded to load.
Sim-Outorder: Execute • ruu_issue() • Models functional units, D-cache issue and executes latencies • Gets instructions that are ready • Reserves free functional unit • Schedules writeback events using latency of the functional unit • Latencies are hardcoded in fu_config[] in sim-outorder.c
Sim-Outorder: Writeback • ruu_writeback() • Models writeback bandwidth, detects mis-predictions, initiated mis-prediction recovery sequence • Gets execution finished instructions (specified in event queue) • Wakes up instructions that are dependent on completed instruction on the dependence chains of instruction output • Detects branch mis-prediction and roll state back to checkpoint
Sim-Outorder: Commit • ruu_commit() • Models in-order retirement of instructions, store commits to the D-cache, and D-TLB miss handling • While head of RUU/LSQ ready to commit • D-TLB miss handling • Retire store to D-cache • Update register file and rename table • Reclaim RUU/LSQ resources
Sim-Outorder:Processor core and other specifications • Instruction fetch, decode and issue bandwidth • Capacity of RUU and LSQ • Branch mis-prediction latency • Number of functional units • integer ALU, integer multipliers/dividers • FP ALU, FP multipliers/dividers • Latency of I-cache/D-cache, memory and TLB • Record statistic by text address
Global Options • These are supported on most simulators -h print help message -d enable debug message -i start up in Dlite! Debugger -q quit immediately (use with -dumpconfig) -config read config parameters from <file> -dumpconfig save config parameters into <file>
How to get help from us • Drop by during TA’s office hour • E-Mail khkim@cse.tamu.edu