Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry Sc

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University

P P C C C Multithreaded Machines Are Everywhere Threads P C C ALPHA 21464, Intel Xeon SUN MAJC, IBM Power4, Sibyte SB-1250 How can we use them? Parallelism!

Automatic Parallelization Proving independence of threads is hard: • complex control flow • complex data structures • pointers, pointers, pointers • run-time inputs How can we make the compiler’s job feasible? Thread-Level Speculation (TLS)

E1 E2 E3 Load Store   TLS  Retry Load  Thread-Level Speculation Epoch1 Time Epoch2 Epoch3 exploit available thread-level parallelism

E1 E2 Load *q Store *p Memory Speculate good when p != q

E1 E1 E2 E2 Load *q Store *p Wait Store *p Memory (stall) Signal (Speculate) Load *q Memory Synchronize (and forward) good when p == q

Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path

E1 E1 E1 E2 E2 E2 Wait Load *q Store *p Signal (stall) Store *p Value Predictor Load *q Load *q Memory Store *p Memory (Speculate) Memory (Synchronize) Predict good when p == q and *q is predictable

reduce critical forwarding path reduce critical forwarding path is there any potential benefit? Improving on Compile-Time Decisions Compiler Hardware Speculate Speculate Predict Synchronize Synchronize

Potential for Improving Value Communication U=Un-optimized, P=Perfect Prediction (4 Processors) efficient value communication is key

Outline Our Support for Thread-Level Speculation • Compiler Support • Experimental Framework • Baseline Performance • Techniques for Improving Value Communication • Combining the Techniques • Conclusions

Compiler Support (SUIF1.3 and gcc) 1) Where to speculate • use profile information, heuristics, loop unrolling 2) Transforming to exploit TLS • insert new TLS-specific instructions • synchronizes/forwards register values 3) Optimization • eliminate dependences due to loop induction variables • algorithm to schedule the critical forwarding path compiler plays a crucial role

C C C P P Crossbar Experimental Framework Benchmarks • from SPECint95 and SPECint2000, -O3 optimization Underlying architecture • 4-processor, single-chip multiprocessor • speculation supported by coherence Simulator • superscalar, similar to MIPS R10K • models all bandwidth and contention detailed simulation!

Compiler Performance S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized compiler optimization is effective

Outline  Our Support for Thread-Level Speculation Techniques for Improving Value Communication • When Prediction is Best • Memory Value Prediction • Forwarded Value Prediction • Silent Stores • When Synchronization is Best • Combining the Techniques • Conclusions

Value Predictor E1 E2 Load *q With Value Store *p  Prediction   Memory Memory avoid failed speculation if *q is predictable Memory Value Prediction E1 E2 Load *q Store *p 

>? Value Predictor Configuration no prediction Aggressive hybrid predictor • 1K x 3-entry context and 1K-entry stride • 2-bit, up/down, saturating confidence counters Context predicted value Stride load PC Confidence Confidence predict only when confident

Throttling Prediction E1 E2 Only predict exposed loads • hardware tracks which words are speculatively modified • use to determine whether a load is exposed Store X Load X Load X exposed not exposed predict only exposed loads

Memory Value Prediction exposed loads are fairly predictable

Memory Value Prediction B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads effective if properly throttled

Value Predictor E1 E2 Load X With Value Store X  Prediction  avoid synchronization stall if X is predictable Forwarded Value Prediction E1 E2 Store X Wait Signal (stall)  Load X 

Forwarded Value Prediction forwarded values are also fairly predictable

Forwarded Value Prediction B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s only predict loads that have caused stalls

E1 E2 (Store X=5) Load X Load X==5?    Exploiting Silent Stores Memory (X=5) Memory (X=5) avoid failed speculation if store is silent Silent Stores E1 E2 Load X Store X=5 

Silent Stores silent stores are prevalent

Impact of Exploiting Silent Stores B=Baseline, SS=Exploit Silent Stores most of the benefits of memory value prediction

Outline  Our Support for Thread-Level Speculation Techniques for Improving Value Communication  When Prediction is Best • When Synchronization is Best • Hardware-Inserted Dynamic Synchronization • Reducing the Critical Forwarding Path • Combining the Techniques • Conclusions

E1 E2 Store *p  (stall)  Load *q With  Dynamic Sync. Memory Memory Hardware-Inserted Dynamic Synchronization E1 E2 Load *q Store *p  avoid failed speculation

Hardware-Inserted Dynamic Synchronization B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum overall average improvement of 9%

Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path

  Load r1=X   op r2=r1,r3 critical path  Store r2,X  Signal With  op r5=r6,r7 Prioritization  op r6=r5,r8 Prioritizing the Critical Forwarding Path • mark the input-chain of the critical store • give marked instructions high issue priority Load r1=X op r2=r1,r3 op r5=r6,r7 critical path op r6=r5,r8 Store r2,X Signal

Critical Path Prioritization some reordering

Impact of Prioritizing the Critical Path B=Baseline, S=Prioritizing Critical Path not much benefit, given the complexity

Outline Our Support for Thread-Level Speculation  Techniques for Improving Value Communication Combining the Techniques • Conclusions

Combining the Techniques Techniques are orthogonal with one exception: Memory value prediction and dynamic sync. • only synchronize memory values that are unpredictable • dynamic sync. logic checks prediction confidence • synchronize if not confident

Combining the Techniques B=Baseline, A=All But Dyn. Sync.,D=All, P=Perfect Prediction close to ideal for m88ksim and vpr

Conclusions Prediction • memory value prediction: effective when throttled • forwarded value prediction: effective when throttled • silent stores: prevalent and effective Synchronization • dynamic synchronization: can help or hurt • hardware prioritization: ineffective, if compiler is good      prediction is effective  synchronization has mixed results

BACKUPS

Goals 1) Parallelize general-purpose programs • difficult problem 2) Keep hardware support simple and minimal • avoid large, specialized structures • preserve the performance of non-TLS workloads 3) Take full advantage of the compiler • region selection, synchronization, optimization

Potential for Further Improvement point

Pipeline Parameters

Memory Parameters

When Prediction is Best Predicting under TLS • only update predictor for successful epochs • cost of misprediction is high: must re-execute epoch • each epoch requires a logically-separate predictor Differentiation from previous work: • loop induction variables optimized by compiler • larger regions of code, hence larger number of memory dependences between epochs

Benchmark Statistics: SPECint2000

Benchmark Statistics: SPECint95

Memory Value Prediction exposed loads are quite predictable

On a dependence violation: Load PC Load PC Exposed Load Table Violating Loads List Load PC Load PC Load PC Load PC Load PC Load PC Load PC cache tag Load PC only predict violating loads Throttling Prediction Further On an exposed load: Exposed Load Table cache tag

Forwarded Value Prediction synchronized loads are also predictable

Silent Stores silent stores are prevalent

Critical Path Prioritization significant reordering

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry Sc