CPU Performance

CPU Performance

Elements of CPU performance • Cycle time • CPU pipeline • Memory system

Pipelining • Several instructions are executed simultaneously at different stages of completion • Various conditions can cause pipeline bubbles that reduce utilization: • branches • memory system delays • etc.

Pipeline structures • ARM7 has 3-stage pipes: • fetch instruction from memory • decode opcode and operands • execute • ARM9 have 5-stage pipes: • Instruction fetch • Decode • Execute • Data memory access • Register write

ARM9 core instruction pipeline

decode execute fetch decode execute fetch decode ARM7 pipeline execution add r0,r1,#5 fetch sub r2,r3,r6 execute cmp r2,#3 time 1 2 3

Performance measures • Latency: time it takes for an instruction to get through the pipeline • Throughput: number of instructions executed per time period • Pipelining increases throughput without reducing latency

Pipeline stalls • If every step cannot be completed in the same amount of time, pipeline stalls • Bubbles introduced by stall increase latency, reduce throughput

ARM multi-cycle LDMIA instruction fetch decode ex ld r2 ex ld r3 ldmia r0,{r2,r3} sub r2,r3,r6 fetch decode ex sub fetch decode ex cmp cmp r2,#3 time

Control stalls • Branches often introduce stalls (branch penalty) • Stall time may depend on whether branch is taken • May have to squash instructions that already started executing • Don’t know what to fetch until condition is evaluated

fetch decode ex bne ex bne ex bne bne foo sub r2,r3,r6 fetch decode fetch decode ex add foo add r0,r1,r2 ARM pipelined branch time

7 4 Example: ARM7 execution time • Determine execution time of FIR filter: for (i=0; i<N; i++) f = f + c[i]*x[i]; ;loop initiation code MOV r0,#0 ;use r0 for i, set to 0 MOV r8,#0 ;use separate index for arrays ADR r2,N ;get address for N LDR r1,[r2] ;get value of N MOV r2,#0 ;use re for f, set to 0 ADR r3,c ;load r3 with the add of base of c array ADR r5,x ;load r5 with the add of base of x array ;loop body loop LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ADD r2,r2,r4 ;add into running sum f ;update loop counter and array index ADD r8,r8,#4 ;add one word offset to array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<N, continue loop Loopend …. 2 2 or 4

ARM7 execution time(2) • Only branch in loop test may take more than one cycle. • BLT loop takes 1 cycle best case, 3 worst case. • tloop = tinit + N(tbody+tupdate)+(N-1)ttest,worst+ttest,best • Branch Penalty • Delayed branch • Branch Prediction • Branch Folding

Delayed branch • To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not ;loop initiation code …….. ;loop body loop LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ADD r2,r2,r4 ;add into running sum f ;update loop counter and array index ADD r8,r8,#4 ;add one word offset to array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<N, continue loop NOP NOP loopend …. ;loop initiation code …….. ;loop body loop LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ;update loop counter and array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<N, continue loop ADD r2,r2,r4 ;add into running sum f ADD r8,r8,#4 ;add one word offset to array index loopend ….

ARM10 processor execution time • Impossible to describe briefly the exact behavior of all instructions in all circumstances • Branch prediction • Prefetch buffer • Branch folding • The independent Load/Store Unit • Data alignment • How many accesses hit in the cache and TLB

ARM10 integer core 3 instr’s

Branch Folding

Branch Foding(2)

Integer core • Prefetch Unit • Fetches instructions from I-cache or external memory • Predicts the outcome of branches whenever it can • Integer Unit • Decode • Barrel shifter, ALU, Multiplier • Main instruction sequencer • Load/store Unit • Load or store two registers(64bits) per cycle • Decouple from the integer unit after the first access of a LDM or STM instruction • Supports Hit-Under-Miss (HUM) operation

Pipeline • Fetch • I-cache access, branch prediction • Issue • Initial instruction decode • Decode • Final instruction decode, register read for ALU op, forwarding, and initial interlock resolution • Execute • Data address calculation, shift, flag setting, CC check, branch mispredict detection, and store data register read • Memory • Data cache access • Write • Register writes, instruction retirement

Typical operations

Load or store operation

LDR operation that misses

Interlocks • Integer core • forwarding to resolve data dependencies between instructions • Pipeline interlocks • Data dependency interlocks: • Instructions that have a source register that is loaded from memory by the previous instruction • Hardware dependency • A new load waiting for the LSU to finish an existing LDM or STM • A load that misses when the HUM slot is already occupied • A new multiply waiting for a previous multiply to free up the first stage of the multiply

Pipeline forwarding paths

Example of interlocking and forwarding • Execute-to-execute mov r0, #1 add r1, r0, #1 • Memory-to-execute ldrr0, [r5] sub r1, r2, #2 add r2, r0, #1

fetch issue decode execute memory write Example of interlocking and forwarding, cont’d • Single cycle interlock ldr r0, [r1, r2]str r3, [r0, r4] r1+r2 r0 read ldr fetch issue decode execute memory write r0+r4 r3 read r3 write str

Superscalar execution • Superscalar processor can execute several instructions per cycle. • Uses multiple pipelined data paths. • Programs execute faster, but it is harder to determine how much faster.

data dependency Data dependencies • Execution time depends on operands, not just opcode. • Superscalar CPU checks data dependencies dynamically: r0 r1 add r2,r0,r1 add r3,r2,r5 r2 r5 r3

Memory system performance • Caches introduce indeterminacy in execution time • Depends on order of execution • Cache miss penalty: added time due to a cache miss • Several reasons for a miss: compulsory, conflict, capacity

CPU Performance

CPU Performance

Presentation Transcript

CPU Performance Evaluation: Cycles Per Instruction (CPI)

CS3410 Guest Lecture A Simple CPU: remaining branch instructions CPU Performance Pipelined CPU

CPU Performance Pipelined CPU

CPU

Lecture 4: CPU Performance

Csci 136 Computer Architecture II – CPU Performance

CPU Performance using Different Parameters

CPU

CPU

CPU

ADABAS Performance Thruput, CPU, IO checklist

CPU