Computer Architecture Pipelines & Superscalars

Computer Architecture Pipelines & Superscalars

Pipelines • Data Hazards • Code: • lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) The last four instructions all depend on a result produced by the first! MIPS instructions have the format op dest, srca, srcb

Pipelines - Data hazards • Examine the pipeline(ignore first 2!) • r2 onlyupdatedin timefor add!

Pipelines - Data Hazards • Compilersolution • InsertNOOPs • Inefficient!

Pipelines - Data Hazards • Second compiler solution • Reorder Read Written lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) sub $2, $1, $3lw $4, 0($1)add $15, $1, $1 and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) These two must not define $1 or $3!

Pipelines - Data Hazards • Second compiler solution • Reorder Read Written sub $2, $1, $3lw $4, 0($1)add $15, $1, $1 and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) First use of $2

Pipelines - Data Hazards • Compiler analyses dependencies • Registerdefinitions • Registeruse • Read After Write(RAW)dependency • No dependencies • Instruction can be moved! Written sub $2, $1, $3lw $4, 0($1)add $15, $1, $1 and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) Uses of $2

Pipelines - Data Hazards • Hardware solution • Value forwarding • Hardware detectsdependency • scoreboard • Forwards resultfrom WB to EXfor subsequentuse • Hardware • Transparent to software!

Data Hazards - classification • Read after Write (RAW) • Instruction 1 must write before instruction 2 reads • Write after Write (WAW) • Instructions 1 and 2 both writeInstruction 2 must write after 1 • Write after Read (WAR) • Instruction 1 readsInstruction 2 writes (overwrites) • Instruction 2 must not write before 1 reads Reordering algorithms must consider all three!

Lecture 5 - Key Points • Data Hazards • RAW - most common • WAW • WAR • Compiler looks for dependencies • then re-orders • Hardware • Scoreboard • Monitors dependencies • ensures correct operation • Value forwarding hardware • Forwards results from EX stage

Pipelines - Exceptions • Caused by overflow, underflow • Example • add $1, $2, $1 • Overflow detected in EX stage • Causes jump to exception handler • as branch - remainder of pipeline flushed but • Compiler needs original $1 causing overflow • Register must not be overwritten • EX stage needs to squash WB operation • Precise Exception problem - more later!

Pipelines - Depth • Pipeline can’t be too deep • Hazards are frequent • many stalls in deep pipelines Too Deep! 2.5 2.0 Relative Performance 1.5 1.0 0.5 1 2 4 8 16 Pipeline Depth

Pipelines - Depth • Pipeline can’t be too deep • Hazards are frequent • many stalls in deep pipelines Too Deep! 2.5 2.0 Relative Performance Superpipelined 1.5 1.0 0.5 1 2 4 8 16 Pipeline Depth

CISC and pipelines • High Speed CISC processors are pipelined • Overlap IF, EX • Variable • instruction length • running time (number of microcode cycles) • pipeline imbalance • “backup” in pipe stages • complicate hazard detection • Complex addressing modes • auto-increment updates address register • multiple memory accesses required • smooth pipeline flow more difficult!

Instruction Queues • Vital performance determinant • Rate of instruction fetch • High Performance processors • Fetch multiple instructions in each cycle • 2 - 4 common • Use wide datapath to memory • PowerPC 604 128 bits = 4 instructions • Despatch unit • Examine dependencies • Determine which instructions can be despatched

Instruction Queues • Q “matches” fetch/despatch rates • General Strategy for matchingProducers - Consumers • Use of FIFO-style Queues • Absorb AsynchronousDelivery / ConsumptionRates • ProvidesElasticityin pipelines Producer Differing Instantaneous Rates FIFO Consumer

Superscalar Processors

Boundary of the Si die PowerPC organisation PowerPC 601 ~1993 • 3-way SuperScalar • Integer • Branch • Floating Point A newer machine will have more functional units here! New - Look in the “Example Processors” section of the Web notes

Superscalar Processors • Multiple Functional Units • PowerPC 604 • 6-way superscalar • Despatch Unit • Sends “ready” instructions to all free units • PowerPC 604: • potential 4 instructions/cycle (pipeline lengths are different!) • reality: 2-3 instructions/cycle?(program dependent!) Branch Unit LoadStore Unit 3 Integer Units Floating Point Unit

Superscalar Processors • Mix of functional units • Up to 8-way superscalar common now • 2 Floating point units • Usually have ~3 cycle latency • 3 Integer Arithmetic • Branch unit • Load / store unit • + ….? • Marketing departments can play some games with the ‘n’ of a n-way superscalar!

Superscalar – Maximum throughput • Instruction Issue Unit is the key! • If IIU only issues 4 instructions per cycle, • An n-way superscalar (n>> 4) can still only complete 4 instructions / cycle! • IIU has many tasks • Pre-fetch instructions • At least one cache line! • Check dependencies • Has data required by this instruction been computed yet? • Keeps register ‘scoreboard’ • Mark registers which will be written by instructions already issued • It’s a small dataflow machine (see later!) • Check availability of functional units

Computer Architecture Pipelines & Superscalars