170 likes | 241 Views
A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook. Single Processor Efficiency is Critical in Parallel Systems. Efficiency Loss in Scaling from 1 to 1000’s of PEs ?. Roughly 2-3x.
E N D
A Monte Carlo Model of In-order Micro-architectural Performance:Decomposing Processor StallsOlaf LubeckRam SrinivasanJeanine Cook
Single Processor Efficiency is Critical in Parallel Systems Efficiency Loss in Scaling from 1 to 1000’s of PEs ? Roughly 2-3x Deterministic Transport Kerbyson, Hoisie, Pautz (2003) Percent of peak on single PE? About 5-8% (12-20x less than peak)
Processor Model: A Monte Carlo Approach to Predict CPI Token Generator Token: Instruction classes Max rate: 1 token every CPII Retire: non-producing tokens Stall producer Tokens Feedback Loop - Stalls: Latencies interacting with app characteristics • Service Centers: • Delays caused by • ALU Latencies • Memory Latencies • Branch Misprediction Inherent CPI: Best application cpi given no processor stalls – infinite zero-latency resources.
Processor Model: A Monte Carlo Approach to Predict CPI Token Generator Token: Instruction classes Max rate: 1 token every CPII Retire: non-producing tokens Producer Tokens Dependence Check and Stall Generator Transition probabilities associated with each path FPU:4 TLB:31 Dependence Distance Generation INT:1 L1:1 GSF:6 L2:6 BrM:6 L3:16 MEM:V Retire
Processor Stalls and Characterization of Application Dependence Major source of stalls for in-order processors are RAW and WAW dependences Prob Distribution: load-to-use distance (instructions) Based on path that token 2 has taken, we compute the stall time (and cause) for token 6: 16 – 4*CPIi • Application pdf’s: • Load-to-use • FP-to-use • INT-to-use
Summary of Model Parameters • Inherent CPI • Binary instrumentation tool • Instruction Classes • INT, FP, Reg, Branch mis-prediction, Loads, Non-producing • Note that Stores are retired immediately (treated as non-producers) • Transition Probabilities • Probabilities of generating each instruction class computed from binary instrumentation • Cache hits computed from performance counters – can be predicted from models • Distribution Functions of dependence distances (measured in instructions) • Load-to-use, FP-to-use, INT-to-use from binary instrumentation • Processor and Memory Latencies – from architecture manuals • Parameters are computed in 1-2 hours • Model converges in a few secs • 3. ~800 lines of C code
Model Accuracy Constant Memory Latencies Bench – 1.3 GHz Itanium 2 3 MB L3 cache 260 cp mem latency Hertz – 900 MHz Itanium 2 1.5 MB L3 cache 112 cp mem latency
Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching Yes Yes No No
Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching 1.Linear relationship between prefetch distance and memory latency 2. Late prefetch can increase memory latency This relationship suggest the Prefetch-to-load pdf
Model Extensions:Toward Multicore Chips (CMP’s) Hertz: slope of 27 cps Bench: slope of 101 cps Slope times are obtained empirically and a function of memory controllers, chip speeds, bus bandwidths, etc. Memory latency as a function of outstanding loads
Application Characteristics: Dependence Distributions Dependence Distances are need to explain stalls Sixtrack L2 Hits: 5.7% Eon L2 Hits: 3.6% But, Eon L2 stalls are 6x larger than Sixtrack
What kinds of questions can we explore with the model? What if FPU was not pipelined? What if L2 was removed (2 Level cache)? What if processor freq was changed (power aware)?
Summary • Monte Carlo techniques can be effectively applied to micro-architecture performance prediction • Main Advantage: whole program analysis, predictive and extensible • Main Problems that we have seen: small loops are not well predicted, binary instrumentation for prefetch can take >24 hrs. • The model is surprisingly accurate given the architectural & application simplifications • Distributions that are used to develop predictive models are significant application characteristics that need to be evaluated • We are ready to go into a “production” mode where we apply the model to a number of in-order architectures: Cell and Niagara