400 likes | 521 Views
PEEP: Exploiting Predictability of Memory Dependences in SMT Processors. Samantika Subramaniam, Milos Prvulovic, Gabriel H. Loh. Simplified view of SMT execution. Front-end. Reservation Stations. Icache. Execution Units. Store per thread state Enough work from all threads put together
E N D
PEEP: Exploiting Predictability of Memory Dependences in SMT Processors Samantika Subramaniam, Milos Prvulovic, Gabriel H. Loh
Simplified view of SMT execution Front-end Reservation Stations Icache Execution Units Store per thread state Enough work from all threads put together High throughput
Something bad happens… Producer insn stalls Front-end Icache Reservation Stations Execution Units Low ILP thread eventually uses up the CPU resources Other independent high ILP threads forced to stall Defeats purpose of SMT Tackle the problem at the source FETCH UNIT
Previously proposed solution ICOUNT (Instruction Count) [Tullsen et al. ISCA 1996]: Count the number of instructions in the pipeline per thread Fetch Policy: Less priority to thread with more instructions Clogged resources OOPS! Front-end Icache Reservation Stations Execution Units REACTIVE EXCLUSION !
So can we do better? “Oracle” Front-end Icache Reservation Stations Execution Units PROACTIVE EXCLUSION !
Proactive Exclusion Strategies (PE) • Load Misses [Moursy et al. ISCA 2003] predicted load miss GATE • MLP [Eyerman et al. HPCA 2007] all available MLP exposed GATE • Memory Dependences
A Brief Overview of Memory Dependences LSQ Memory Dependence Predictor PRED ADDR INST 0xF023 ST 1 PC 1 0xF380 LD 1 0xF793 0xF060 ST 2 ? 0xF060 LD 2 Predictability of Memory Dependences Predictor can indicate future stalls
Proactive Exclusion using Memory Dependences T0 T0 ST LD LD T1 T1 ST LD ST T2 T2 Learn ST-LD relationships ST A LD A ST ? LD A T3 T3
Starvation: Problem with Proactive Exclusion Stall resolves Insn enters RS T0 T1 Reservation Stations Exclusion (any strategy) could cause temporary STARVATION T2 T3 Especially bad for short duration stalls!!!
Addr Resolves Addr Resolves ST ? LD A ADD SUB Short Duration Stall ST A LD A ADD SUB ST ? LD A ADD SUB ADD Original ST A LD A ADD SUB ST ? LD A ADD SUB ADD Original + PE Memory Dependence Predictor
DELAY PRED LSQ PC 1 20 0xF023 ST 1 0xF380 LD 1 ST 2 0xF060 LD 2 Predictability of Memory Disambiguation Latency Predictor can indicate duration of future stalls Can we avoid starvation? With PE based on memory dependences we can Memory Dependence Predictor ADDR INST ? 0xF060 20 cycles
Delay Predictor Details Memory Dependence Predictor • Conservative Maximum observed delay • Aggressive Last observed delay • Adaptive Average of last observed ‘n’ delays DELAY PRED PC 1 20
ST ? LD A ADD SUB How does this help us? ST A LD A ADD SUB ADD Addr resolves Original ST A LD A ADD SUB ST ? LD A ADD SUB ADD Addr resolves Memory Dependence Predictor Original + PE Choose an appropriate delay threshold
B:LD 1 Performance Impact of Delay Information Phase 1: After 20 cycles… MDP A:ST 1 B:LD 1 P D ST ? ST xF060 B 0 1 0 20 ST 1 . . . A:ST 21 B:LD 21 LD1 LD xF060 Reservation Stations P: prediction D: delay Execution Units
B:LD 21 Phase 2: Delay Threshold = Front End Depth = 5 MDP A:ST 1 B:LD 1 P D B 1 20 . . . A:ST 21 B:LD 21 Front-end P: prediction D: delay
PE without delay information Phase 3: Front End Depth = 5 Reservation Stations Front-end Stall resolves Restart fetch Insn enter RS 5 20 25 cycles Instructions enter RS after stall resolves
PE with delay information Phase 3: Delay Threshold = Front End Depth = 5 ReservationStations Front-end Restart fetch Stall resolves Insn enter RS 5 15 20 cycles Instructions enter RS right in time as stall resolves
PEEP What does this give us? • Proactive Exclusion • When a memory dependence stall is predicted • Avoid starvation • Ignore short stalls • Give the thread a head start • Restart fetch of gated thread few cycles before stall resolves Early Parole!!! PROACTIVE EXCLUSION AND EARLY PAROLE
PEEP In Our Context Memory Dependence and Delay Predictor 20 cycles Front-end Icache Reservation Stations Execution Units Predicted delay – FE pipeline depth 15 cycles
Simulation Parameters • Aggressive four-way SMT processor • MDP modeled on Load Wait Table • SPEC2000, MediaBench and others • 32 four-thread application mixes evaluated • Application Classification S: sensitive to memory dependences N: non-sensitive to memory dependences L: low-ILP M: medium-ILP H: high-ILP
13% Proactive Exclusion Strategies S: Sensitive N: Non-sensitive L: low-ILP M: medium-ILP H: high-ILP • PE using memory dependencies shows 13% speedup • Maximum benefit with both sensitive (S) and non-sensitive (N) threads • All sensitive threads: all PE strategies perform comparably
PEEP 17% • PEEP using delay prediction outperforms MLP and PE mdep • All sensitive threads: PEEP does better since it can predict stall durations accurately • PEEP with an oracle-based MDP shows performance speedup of 19%
2-threaded Workloads 12% • Less threads less opportunities to fetch from non-stalled threads • 12% performance speedup over 25 application mixes shows there is potential benefit even in a 2-way SMT Intel Simulator shows 8% performance speedup over 150 application mixes
Relationship with OOO Load Scheduling Hypothesis: Performance benefit purely due to a more efficient fetch policy based on a highly predictable attribute Experiment: PEEP on a processor without OOO memory scheduling Prediction is used only for controlling fetch policy Result: Avg. Speedup over ICOUNT=17% (same as PEEP!) Conclusion: Memory Dependencies are a very good indicator of future stalls Even a machine without load reordering benefits from predicting these stalls
ST 1 ST 1 LD 2 LD 2 LD 3 LD 4 ST 1 LD 2 LD 3 Why does it work so well? LMP PEEP LD 1 LD 1 ST 1 ST 1 LD 2 LD 2 LD 3 LD 3 Reservation Stations Reservation Stations LD 4 LD 4
ST 1 LD 2 ST 1 LD 2 ADD SUB LMP PEEP MLP LD 1 LD 1 LD 1 ST 1 ST 1 ST 1 LD 2 LD 2 LD 2 ADD ADD ADD Reservation Stations Reservation Stations Reservation Stations SUB SUB SUB Can expose more ILP
Key Points • Need a mechanism for efficient resource management in SMT • Improve the fetch unit • Memory Dependences and Associated Latencies are predictable • Proactively Exclude “bad” threads but give them Early Parole to avoid temporary starvation • Performance improvements on both 4-way and 2-way SMT machines
Thank You www.cc.gatech.edu/~samantik LD LD LD LD LD LD LD LD “ When will I get paroled?”
Predictor Size Delay Threshold
B2:PEEP* 17.3% • Memory Dependences are a very good indicator of future stalls • Performance shows that PEEP works because it leverages knowledge of future stalls to improve instruction fetch
B3:Fairness 19% • Speedup is computed for harmonic mean of weighted IPCs • Since all PE strategies run on top of ICOUNT, they inherit its fairness • SDS (standard deviation of speedup) for PEEP ~ 0.17 and for ICOUNT ~ 0.11
B7: Delay Predictors Conservative: Maximum observed delay Aggressive: Last observed delay Adaptive: Average of last ‘n’ observed delays