500 likes | 616 Views
Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University. P. P. C. C. C. Multithreaded Machines Are Everywhere. Threads. P. C. C. ALPHA 21464, Intel Xeon.
E N D
Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University
P P C C C Multithreaded Machines Are Everywhere Threads P C C ALPHA 21464, Intel Xeon SUN MAJC, IBM Power4, Sibyte SB-1250 How can we use them? Parallelism!
Automatic Parallelization Proving independence of threads is hard: • complex control flow • complex data structures • pointers, pointers, pointers • run-time inputs How can we make the compiler’s job feasible? Thread-Level Speculation (TLS)
E1 E2 E3 Load Store TLS Retry Load Thread-Level Speculation Epoch1 Time Epoch2 Epoch3 exploit available thread-level parallelism
E1 E2 Load *q Store *p Memory Speculate good when p != q
E1 E1 E2 E2 Load *q Store *p Wait Store *p Memory (stall) Signal (Speculate) Load *q Memory Synchronize (and forward) good when p == q
Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path
E1 E1 E1 E2 E2 E2 Wait Load *q Store *p Signal (stall) Store *p Value Predictor Load *q Load *q Memory Store *p Memory (Speculate) Memory (Synchronize) Predict good when p == q and *q is predictable
reduce critical forwarding path reduce critical forwarding path is there any potential benefit? Improving on Compile-Time Decisions Compiler Hardware Speculate Speculate Predict Synchronize Synchronize
Potential for Improving Value Communication U=Un-optimized, P=Perfect Prediction (4 Processors) efficient value communication is key
Outline Our Support for Thread-Level Speculation • Compiler Support • Experimental Framework • Baseline Performance • Techniques for Improving Value Communication • Combining the Techniques • Conclusions
Compiler Support (SUIF1.3 and gcc) 1) Where to speculate • use profile information, heuristics, loop unrolling 2) Transforming to exploit TLS • insert new TLS-specific instructions • synchronizes/forwards register values 3) Optimization • eliminate dependences due to loop induction variables • algorithm to schedule the critical forwarding path compiler plays a crucial role
C C C P P Crossbar Experimental Framework Benchmarks • from SPECint95 and SPECint2000, -O3 optimization Underlying architecture • 4-processor, single-chip multiprocessor • speculation supported by coherence Simulator • superscalar, similar to MIPS R10K • models all bandwidth and contention detailed simulation!
Compiler Performance S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized compiler optimization is effective
Outline Our Support for Thread-Level Speculation Techniques for Improving Value Communication • When Prediction is Best • Memory Value Prediction • Forwarded Value Prediction • Silent Stores • When Synchronization is Best • Combining the Techniques • Conclusions
Value Predictor E1 E2 Load *q With Value Store *p Prediction Memory Memory avoid failed speculation if *q is predictable Memory Value Prediction E1 E2 Load *q Store *p
>? Value Predictor Configuration no prediction Aggressive hybrid predictor • 1K x 3-entry context and 1K-entry stride • 2-bit, up/down, saturating confidence counters Context predicted value Stride load PC Confidence Confidence predict only when confident
Throttling Prediction E1 E2 Only predict exposed loads • hardware tracks which words are speculatively modified • use to determine whether a load is exposed Store X Load X Load X exposed not exposed predict only exposed loads
Memory Value Prediction exposed loads are fairly predictable
Memory Value Prediction B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads effective if properly throttled
Value Predictor E1 E2 Load X With Value Store X Prediction avoid synchronization stall if X is predictable Forwarded Value Prediction E1 E2 Store X Wait Signal (stall) Load X
Forwarded Value Prediction forwarded values are also fairly predictable
Forwarded Value Prediction B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s only predict loads that have caused stalls
E1 E2 (Store X=5) Load X Load X==5? Exploiting Silent Stores Memory (X=5) Memory (X=5) avoid failed speculation if store is silent Silent Stores E1 E2 Load X Store X=5
Silent Stores silent stores are prevalent
Impact of Exploiting Silent Stores B=Baseline, SS=Exploit Silent Stores most of the benefits of memory value prediction
Outline Our Support for Thread-Level Speculation Techniques for Improving Value Communication When Prediction is Best • When Synchronization is Best • Hardware-Inserted Dynamic Synchronization • Reducing the Critical Forwarding Path • Combining the Techniques • Conclusions
E1 E2 Store *p (stall) Load *q With Dynamic Sync. Memory Memory Hardware-Inserted Dynamic Synchronization E1 E2 Load *q Store *p avoid failed speculation
Hardware-Inserted Dynamic Synchronization B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum overall average improvement of 9%
Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path
Load r1=X op r2=r1,r3 critical path Store r2,X Signal With op r5=r6,r7 Prioritization op r6=r5,r8 Prioritizing the Critical Forwarding Path • mark the input-chain of the critical store • give marked instructions high issue priority Load r1=X op r2=r1,r3 op r5=r6,r7 critical path op r6=r5,r8 Store r2,X Signal
Critical Path Prioritization some reordering
Impact of Prioritizing the Critical Path B=Baseline, S=Prioritizing Critical Path not much benefit, given the complexity
Outline Our Support for Thread-Level Speculation Techniques for Improving Value Communication Combining the Techniques • Conclusions
Combining the Techniques Techniques are orthogonal with one exception: Memory value prediction and dynamic sync. • only synchronize memory values that are unpredictable • dynamic sync. logic checks prediction confidence • synchronize if not confident
Combining the Techniques B=Baseline, A=All But Dyn. Sync.,D=All, P=Perfect Prediction close to ideal for m88ksim and vpr
Conclusions Prediction • memory value prediction: effective when throttled • forwarded value prediction: effective when throttled • silent stores: prevalent and effective Synchronization • dynamic synchronization: can help or hurt • hardware prioritization: ineffective, if compiler is good prediction is effective synchronization has mixed results
Goals 1) Parallelize general-purpose programs • difficult problem 2) Keep hardware support simple and minimal • avoid large, specialized structures • preserve the performance of non-TLS workloads 3) Take full advantage of the compiler • region selection, synchronization, optimization
When Prediction is Best Predicting under TLS • only update predictor for successful epochs • cost of misprediction is high: must re-execute epoch • each epoch requires a logically-separate predictor Differentiation from previous work: • loop induction variables optimized by compiler • larger regions of code, hence larger number of memory dependences between epochs
Memory Value Prediction exposed loads are quite predictable
On a dependence violation: Load PC Load PC Exposed Load Table Violating Loads List Load PC Load PC Load PC Load PC Load PC Load PC Load PC cache tag Load PC only predict violating loads Throttling Prediction Further On an exposed load: Exposed Load Table cache tag
Forwarded Value Prediction synchronized loads are also predictable
Silent Stores silent stores are prevalent
Critical Path Prioritization significant reordering