Trace-Level Speculative Multithreaded Architecture

ICCD´02, Freiburg (Germany) - September 16-18, 2002 Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spaincmolina@etse.urv.es Antonio González and Jordi Tubella Universitat Politècnica de Catalunya – Barcelona, Spain {antonio,jordit}@ac.upc.es

Outline • Motivation • Related Work • TSMA • Performance Results • Conclusions

Motivation • Two techniques to avoid serialization caused by data dependences • Data Value Speculation • Data Value Reuse • Speculation predicts values based on past • Reuse is posible if has been done in the past • Both may be considered at two levels • Instruction Level • Trace Level

Trace Level Reuse Static Dynamic Trace Level Reuse • Set of instructions can be skipped in a row • These instructions do not need to be fetched • Live input test is not easy to handle

Trace Level Speculation With Live Output Test With Live Input Test Trace Level Speculation • Solves live input test • Introduces penalties due to misspeculations • Two orthogonal issues • microarchitecture support for trace speculation • control and data speculation techniques • prediction of initial and final points • prediction of live output values

Live Output Actualization & Trace Speculation INSTRUCTION EXECUTION NOT EXECUTED LIVE INPUT VALIDATION & INSTRUCTION EXECUTION Trace Level Speculation with Live Input Test ST NST Miss Trace Speculation Detection & Recovery Actions

Live Output Actualization & Trace Speculation BUFFER BUFFER INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION Trace Level Speculation with Live Output Test ST NST Miss Trace Speculation Detection & Recovery Actions

Related Work • Trace Level Reuse • Basic blocks (Huang and Lilja, 99) • General traces (González et al, 99) • Traces with compiler support (Connors and Hwu, 99) • Trace Level Speculation • DIVA (Austin, 99) • Slipstream processors (Rotenberg etal, 99) • Pre-execution (Sohi et al, 01) • Precomputation (Shen et al, 01) • Nearby and distant ILP (Balasubramonian etal, 01)

ST I Window NST I Window ST Ld/St Queue Branch Decode & Functional Fetch I NST Ld/St Queue Units Engine Predictor Cache Rename ST Reorder Buffer Trace NST Reorder Buffer Speculation Data L1SDC Cache NST Arch. Verification ST Arch. Register File Engine Register File L1NSDC L2NSDC TSMA Look Ahead Buffer

Trace Speculation Engine • Two issues may handle • to implement a trace level predictor • to communicate trace speculation opportunity • Trace level predictor • PC-indexed table with N entries • Each entry contains • live output values • final program counter of trace • Trace speculation communication • INI_TRACE instruction • Additional MOVE instrucions

Look Ahead Buffer • First-input first-output queue • Stores instructions executed by ST • The fields of each entry are: • Program Counter • Operation Type: indicates memory operation • Source register Id 1 & source value 1 • Source register Id 2 & source value 2 • Destination register Id & destination value • Memory address

Verification Engine • Validates speculated instructions • Mantains the non-speculative state • Consumes instructions from LAB • Test is performed as follows: • testing source values of Is with non-speculative state • if matching, destination value of I may be updated • memory operations check effective address • store instructions update memory, rest update registers • Hardware required is minimal

Thread Synchronization • Handles trace misspredictions • Recovery actions involved are: • Instruction execution is stopped • ST structures are emptied (IW,LSQ,ROB,LAB) • Speculative cache and ST register file are invalidated • Two types of synchronization • Total (Occurs when NST is not executing instructions) • Penalty due to fill again the pipeline • Partial (Occurs when NST is executing instructions) • No penalty • NST takes the role of ST

Rules 1 2 3 4 5 Additional and small first level cache is added to mantain memory speculative state Traditional memory subsystem is supported L1SDC NST store updates values and allocate space in NS caches ST store updates values in L1SDC only ST load get values from L1SDC. If not, get from NS caches NST loads get values and allocates space in NS caches Line replaced in L1NSDC is copied back to L2NSDC L1NSDC L2NSDC Memory Subsystem • Mantains memory state • speculative • non speculative

Register File • Slight modification to permit prompt execution • Register map table contains for each entry: • Commited Value • ROB Tag • Counter • Counter field is mantained as follows: • New ST instruction increases dest. register counter • Counter is decreased when ST instruction is commited • After trace speculation counter are no longer increased • But it is decreased until reaches the value zero.

10 10 Live Output Actualization & Trace Speculation NST Begins Execution NST Executes Speculated Trace VE Validates Instructions ST Begins Execution VE Begins Verification VE Finishes Verification Live Output Actualization & Trace Speculation NST Executes Some Additional Instructions NST Execution 8 8 2 3 4 5 6 7 9 1 9 7 2 3 4 5 6 1 INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION Working Example ST NST VE

Experimental Framework • Simulator Alpha version of the SimpleScalar Toolset • Benchmarks Spec95 • Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5 • Statistics Collected for 125 million instructions Skipping initializations

Base Microarchitecture

TSMA Additional Stuctures

Performance Evaluation • Main objective: • trace misspeculations cause minor penalties • Traces are built following a simple rule • from backward branch to backward branch • minimum and maximum size of 8 and 64 respectively • Simple Trace Predictor is evaluated • Stride + Context Value (history of 9) • Results provided • Percentage of misspeculations • Percentage of predicted instructions • Speedup

Misspeculations 100 90 80 70 60 50 40 30 20 10 0

Predicted Instructions 50 40 30 20 10 0

Speedup 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00

Conclusions • TSMA • designed to exploit trace-level speculation • Special emphasis on • minimizing misspeculation penalties • Results show: • architecture is tolerant to misspeculations • speedup of 16% with a predictor that misses 70%

Future Work • Agressive trave level predictors • bigger traces • better value predictors • Generalization to multiple threads • cascade execution • Mixing prediction & execution • speculated traces do not need to be fully speculated

Trace-Level Speculative Multithreaded Architecture

Trace-Level Speculative Multithreaded Architecture

Presentation Transcript

VLIW Speculative Trace Scheduling

CS184c: Computer Architecture [Parallel and Multithreaded]

Systems-Level Architecture

Multithreaded Clustering for Multi-level Hypergraph Partitioning

CS184c: Computer Architecture [Parallel and Multithreaded]

A Heterogeneous Lightweight Multithreaded Architecture

CS184c: Computer Architecture [Parallel and Multithreaded]

Mixed Speculative Multithreaded Execution Models

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

High-level Multithreaded Programming [Part II]

High-level Multithreaded Programming [Part I]

High-level Multithreaded Programming [Part III]

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

A Multithreaded Architecture

CS184c: Computer Architecture [Parallel and Multithreaded]