PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors Samantika Subramaniam, Milos Prvulovic, Gabriel H. Loh

Simplified view of SMT execution Front-end Reservation Stations Icache Execution Units Store per thread state Enough work from all threads put together High throughput

Something bad happens… Producer insn stalls Front-end Icache Reservation Stations Execution Units Low ILP thread eventually uses up the CPU resources Other independent high ILP threads forced to stall Defeats purpose of SMT Tackle the problem at the source FETCH UNIT

Previously proposed solution ICOUNT (Instruction Count) [Tullsen et al. ISCA 1996]: Count the number of instructions in the pipeline per thread Fetch Policy: Less priority to thread with more instructions Clogged resources OOPS! Front-end Icache Reservation Stations Execution Units REACTIVE EXCLUSION !

So can we do better? “Oracle” Front-end Icache Reservation Stations Execution Units PROACTIVE EXCLUSION !

Proactive Exclusion Strategies (PE) • Load Misses [Moursy et al. ISCA 2003] predicted load miss GATE • MLP [Eyerman et al. HPCA 2007] all available MLP exposed  GATE • Memory Dependences

A Brief Overview of Memory Dependences LSQ Memory Dependence Predictor PRED ADDR INST 0xF023 ST 1 PC 1 0xF380 LD 1 0xF793 0xF060 ST 2 ? 0xF060 LD 2 Predictability of Memory Dependences Predictor can indicate future stalls

Proactive Exclusion using Memory Dependences T0 T0 ST LD LD T1 T1 ST LD ST T2 T2 Learn ST-LD relationships ST A LD A ST ? LD A T3 T3

Starvation: Problem with Proactive Exclusion Stall resolves Insn enters RS T0 T1 Reservation Stations Exclusion (any strategy) could cause temporary STARVATION T2 T3 Especially bad for short duration stalls!!!

Addr Resolves Addr Resolves ST ? LD A ADD SUB Short Duration Stall ST A LD A ADD SUB ST ? LD A ADD SUB ADD Original ST A LD A ADD SUB ST ? LD A ADD SUB ADD Original + PE Memory Dependence Predictor

DELAY PRED LSQ PC 1 20 0xF023 ST 1 0xF380 LD 1 ST 2 0xF060 LD 2 Predictability of Memory Disambiguation Latency Predictor can indicate duration of future stalls Can we avoid starvation? With PE based on memory dependences we can Memory Dependence Predictor ADDR INST ? 0xF060 20 cycles

Delay Predictor Details Memory Dependence Predictor • Conservative Maximum observed delay • Aggressive Last observed delay • Adaptive Average of last observed ‘n’ delays DELAY PRED PC 1 20

ST ? LD A ADD SUB How does this help us? ST A LD A ADD SUB ADD Addr resolves Original ST A LD A ADD SUB ST ? LD A ADD SUB ADD Addr resolves Memory Dependence Predictor Original + PE Choose an appropriate delay threshold

B:LD 1 Performance Impact of Delay Information Phase 1: After 20 cycles… MDP A:ST 1 B:LD 1 P D ST ? ST xF060 B 0 1 0 20 ST 1 . . . A:ST 21 B:LD 21 LD1 LD xF060 Reservation Stations P: prediction D: delay Execution Units

B:LD 21 Phase 2: Delay Threshold = Front End Depth = 5 MDP A:ST 1 B:LD 1 P D B 1 20 . . . A:ST 21 B:LD 21 Front-end P: prediction D: delay

PE without delay information Phase 3: Front End Depth = 5 Reservation Stations Front-end Stall resolves Restart fetch Insn enter RS 5 20 25 cycles Instructions enter RS after stall resolves

PE with delay information Phase 3: Delay Threshold = Front End Depth = 5 ReservationStations Front-end Restart fetch Stall resolves Insn enter RS 5 15 20 cycles Instructions enter RS right in time as stall resolves

PEEP What does this give us? • Proactive Exclusion • When a memory dependence stall is predicted • Avoid starvation • Ignore short stalls • Give the thread a head start • Restart fetch of gated thread few cycles before stall resolves Early Parole!!! PROACTIVE EXCLUSION AND EARLY PAROLE

PEEP In Our Context Memory Dependence and Delay Predictor 20 cycles Front-end Icache Reservation Stations Execution Units Predicted delay – FE pipeline depth 15 cycles

Simulation Parameters • Aggressive four-way SMT processor • MDP modeled on Load Wait Table • SPEC2000, MediaBench and others • 32 four-thread application mixes evaluated • Application Classification S: sensitive to memory dependences N: non-sensitive to memory dependences L: low-ILP M: medium-ILP H: high-ILP

13% Proactive Exclusion Strategies S: Sensitive N: Non-sensitive L: low-ILP M: medium-ILP H: high-ILP • PE using memory dependencies shows 13% speedup • Maximum benefit with both sensitive (S) and non-sensitive (N) threads • All sensitive threads: all PE strategies perform comparably

PEEP 17% • PEEP using delay prediction outperforms MLP and PE mdep • All sensitive threads: PEEP does better since it can predict stall durations accurately • PEEP with an oracle-based MDP shows performance speedup of 19%

2-threaded Workloads 12% • Less threads  less opportunities to fetch from non-stalled threads • 12% performance speedup over 25 application mixes shows there is potential benefit even in a 2-way SMT Intel Simulator shows 8% performance speedup over 150 application mixes

Relationship with OOO Load Scheduling Hypothesis: Performance benefit purely due to a more efficient fetch policy based on a highly predictable attribute Experiment: PEEP on a processor without OOO memory scheduling Prediction is used only for controlling fetch policy Result: Avg. Speedup over ICOUNT=17% (same as PEEP!) Conclusion: Memory Dependencies are a very good indicator of future stalls Even a machine without load reordering benefits from predicting these stalls

ST 1 ST 1 LD 2 LD 2 LD 3 LD 4 ST 1 LD 2 LD 3 Why does it work so well? LMP PEEP LD 1 LD 1 ST 1 ST 1 LD 2 LD 2 LD 3 LD 3 Reservation Stations Reservation Stations LD 4 LD 4

ST 1 LD 2 ST 1 LD 2 ADD SUB LMP PEEP MLP LD 1 LD 1 LD 1 ST 1 ST 1 ST 1 LD 2 LD 2 LD 2 ADD ADD ADD Reservation Stations Reservation Stations Reservation Stations SUB SUB SUB Can expose more ILP

Key Points • Need a mechanism for efficient resource management in SMT • Improve the fetch unit • Memory Dependences and Associated Latencies are predictable • Proactively Exclude “bad” threads but give them Early Parole to avoid temporary starvation • Performance improvements on both 4-way and 2-way SMT machines

Thank You www.cc.gatech.edu/~samantik LD LD LD LD LD LD LD LD “ When will I get paroled?”

B1:Sensitivity Analysis

Predictor Size Delay Threshold

B2:PEEP* 17.3% • Memory Dependences are a very good indicator of future stalls • Performance shows that PEEP works because it leverages knowledge of future stalls to improve instruction fetch

B3:Fairness 19% • Speedup is computed for harmonic mean of weighted IPCs • Since all PE strategies run on top of ICOUNT, they inherit its fairness • SDS (standard deviation of speedup) for PEEP ~ 0.17 and for ICOUNT ~ 0.11

B4: OOO memory scheduling on SMT machine

B5: Accuracy of MDP

B6: Delays associated with PEEP

B7: Delay Predictors Conservative: Maximum observed delay Aggressive: Last observed delay Adaptive: Average of last ‘n’ observed delays

B8:Simulator Configuration

4-threaded mixes

2-threaded mixes

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

Presentation Transcript

Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies

Exploiting Perl on Windows with OLE/COM

It’s More Than DB2 Exploiting the Open Transaction Environment

Chapter 5. The Memory System

Memory

Reconfigurable Computing - Memory in FPGAs

III. Multicore Processors (6)

Memory

LUNG EXPANSION IS, PAP , PEP, IPPB, IPV, PEEP/CPAP and BIPAP

Memory Management

Cognition: Memory How Does your Memory Work 1 of 5 - BBC Horizon Documentary.flv

Memory interface

Chapter 5

Memory Interfacing

Memory

Virtual Memory

Semiconductor Memories

Chapter 2 Pipelined Processors

Chapter 4 Internal Memory

Chaos Theory and Predictability