250 likes | 370 Views
Computer Architecture Pipelines & Superscalars. Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga. Pipelines. Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15,100($2)
E N D
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga
Pipelines • Data Hazards • Code: • lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) The last four instructions all depend on a result produced by the first! MIPS instructions have the format op dest, srca, srcb
Pipelines - Data hazards • Examine the pipeline(ignore first 2!) • r2 onlyupdatedin timefor add!
Pipelines - Data Hazards • Compilersolution • InsertNOOPs • Inefficient!
Pipelines - Data Hazards • Second compiler solution • Reorder Read Written lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) sub $2, $1, $3lw $4, 0($1)add $15, $1, $1 and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) These two must not define $1 or $3!
Pipelines - Data Hazards • Second compiler solution • Reorder Read Written sub $2, $1, $3lw $4, 0($1)add $15, $1, $1 and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) First use of $2
Pipelines - Data Hazards • Compiler analyses dependencies • Registerdefinitions • Registeruse • Read After Write(RAW)dependency • No dependencies • Instruction can be moved! Written sub $2, $1, $3lw $4, 0($1)add $15, $1, $1 and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2) Uses of $2
Pipelines - Data Hazards • Hardware solution • Value forwarding • Hardware detectsdependency • scoreboard • Forwards resultfrom WB to EXfor subsequentuse • Hardware • Transparent to software!
Data Hazards - classification • Read after Write (RAW) • Instruction 1 must write before instruction 2 reads • Write after Write (WAW) • Instructions 1 and 2 both writeInstruction 2 must write after 1 • Write after Read (WAR) • Instruction 1 readsInstruction 2 writes (overwrites) • Instruction 2 must not write before 1 reads Reordering algorithms must consider all three!
Lecture 5 - Key Points • Data Hazards • RAW - most common • WAW • WAR • Compiler looks for dependencies • then re-orders • Hardware • Scoreboard • Monitors dependencies • ensures correct operation • Value forwarding hardware • Forwards results from EX stage
Pipelines - Exceptions • Caused by overflow, underflow • Example • add $1, $2, $1 • Overflow detected in EX stage • Causes jump to exception handler • as branch - remainder of pipeline flushed but • Compiler needs original $1 causing overflow • Register must not be overwritten • EX stage needs to squash WB operation • Precise Exception problem - more later!
Superpipelines • Time to complete each instruction = t • Total: Fetch + decode + fetch operands + operation + write-back • Clock frequency: f = 1/t • An n-stage pipeline allows n instructions ‘in flight’ simultaneously • Each pipeline stage does 1/n of the work • Each stage requires time t/n • Assumes a perfectly balanced pipeline! • Balanced = each stage requires the same time • Clock frequency: fpipe = 1/(t/n) = n/t • Increasing n increases processor power?
Pipelines - Depth • Pipeline can’t be too deep • Hazards are frequent • many stalls in deep pipelines Too Deep! 2.5 2.0 Relative Performance 1.5 1.0 0.5 1 2 4 8 16 Pipeline Depth
Pipelines - Depth • Pipeline can’t be too deep • Hazards are frequent • many stalls in deep pipelines Too Deep! 2.5 2.0 Relative Performance Superpipelined 1.5 1.0 0.5 1 2 4 8 16 Pipeline Depth
Pipeline depth • Increasing number of stages • Each stage adds overheads • Problems balancing pipeline • Require tpd1≈ tpd2≈ tpd3 • Stage time istpdj + tpdreg • n stages means n tpdreg overhead Operation (work) Operation (work) Operation (work) Register Register Register tpd1 tpdreg tpd2 tpdreg tpd3 tpdreg
CISC and pipelines • High Speed CISC processors are pipelined • Overlap IF, EX • Variable • instruction length • running time (number of microcode cycles) • pipeline imbalance • “backup” in pipe stages • complicate hazard detection • Complex addressing modes • auto-increment updates address register • multiple memory accesses required • smooth pipeline flow more difficult!
Instruction Queues • Vital performance determinant • Rate of instruction fetch • High Performance processors • Fetch multiple instructions in each cycle • 2 - 4 common • Use wide datapath to memory • PowerPC 604 128 bits = 4 instructions • Despatch unit • Examine dependencies • Determine which instructions can be despatched
Instruction Queues • Q “matches” fetch/despatch rates • General Strategy for matchingProducers - Consumers • Use of FIFO-style Queues • Absorb AsynchronousDelivery / ConsumptionRates • ProvidesElasticityin pipelines Producer Differing Instantaneous Rates FIFO Consumer
Boundary of the Si die PowerPC organisation PowerPC 601 ~1993 • 3-way SuperScalar • Integer • Branch • Floating Point A newer machine will have more functional units here! New - Look in the “Example Processors” section of the Web notes
Superscalar Processors • Multiple Functional Units • PowerPC 604 • 6-way superscalar • Despatch Unit • Sends “ready” instructions to all free units • PowerPC 604: • potential 4 instructions/cycle (pipeline lengths are different!) • reality: 2-3 instructions/cycle?(program dependent!) Branch Unit LoadStore Unit 3 Integer Units Floating Point Unit
Superscalar Processors • Mix of functional units • Up to 8-way superscalar common now • 2 Floating point units • Usually have ~3 cycle latency • 3 Integer Arithmetic • Branch unit • Load / store unit • + ….? • Marketing departments can play some games with the ‘n’ of a n-way superscalar!
Pentium Quad Core - 2008 • Distinguish between • Multiple ‘cores’ (separate processors) – later –and • Superscalars – multiple functional units per processor • “Wide dynamic execution” in Intel-speak • Quad core • 4 cores • Complete up to 4 instructions / cycle each • IIU can issue four instructions / cycle • 3 Mb L2 cache / processor (total 12Mb) • Master clock 3.2 GHz, front side bus 1.6GHz • 771 pins
Superscalar Limitations • To achieve maximum performance • Instruction mix must match Functional Unit mix • eg if we have 2 Integer ALUs, 2 FPUs, 1 branch unit, 1 load/store unit • Instruction issue unit (IIU) can issue 4 instructions • Each four instructions should be able to use 4 of the functional units • If instruction stream doesn’t have right mix • Some functional units will remain idle • FPUs require multiple cycles • Additional stalls • Pipeline hazards stall pipeline • 4-way superscalar gets 1.8-3 instructions completed per cycle • Program dependent!