1 / 30

CPU Performance

CPU Performance. Elements of CPU performance. Cycle time CPU pipeline Memory system. Pipelining. Several instructions are executed simultaneously at different stages of completion Various conditions can cause pipeline bubbles that reduce utilization: branches memory system delays etc.

gail-wiley
Download Presentation

CPU Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPU Performance

  2. Elements of CPU performance • Cycle time • CPU pipeline • Memory system

  3. Pipelining • Several instructions are executed simultaneously at different stages of completion • Various conditions can cause pipeline bubbles that reduce utilization: • branches • memory system delays • etc.

  4. Pipeline structures • ARM7 has 3-stage pipes: • fetch instruction from memory • decode opcode and operands • execute • ARM9 have 5-stage pipes: • Instruction fetch • Decode • Execute • Data memory access • Register write

  5. ARM9 core instruction pipeline

  6. decode execute fetch decode execute fetch decode ARM7 pipeline execution add r0,r1,#5 fetch sub r2,r3,r6 execute cmp r2,#3 time 1 2 3

  7. Performance measures • Latency: time it takes for an instruction to get through the pipeline • Throughput: number of instructions executed per time period • Pipelining increases throughput without reducing latency

  8. Pipeline stalls • If every step cannot be completed in the same amount of time, pipeline stalls • Bubbles introduced by stall increase latency, reduce throughput

  9. ARM multi-cycle LDMIA instruction fetch decode ex ld r2 ex ld r3 ldmia r0,{r2,r3} sub r2,r3,r6 fetch decode ex sub fetch decode ex cmp cmp r2,#3 time

  10. Control stalls • Branches often introduce stalls (branch penalty) • Stall time may depend on whether branch is taken • May have to squash instructions that already started executing • Don’t know what to fetch until condition is evaluated

  11. fetch decode ex bne ex bne ex bne bne foo sub r2,r3,r6 fetch decode fetch decode ex add foo add r0,r1,r2 ARM pipelined branch time

  12. 7 4 Example: ARM7 execution time • Determine execution time of FIR filter: for (i=0; i<N; i++) f = f + c[i]*x[i]; ;loop initiation code MOV r0,#0 ;use r0 for i, set to 0 MOV r8,#0 ;use separate index for arrays ADR r2,N ;get address for N LDR r1,[r2] ;get value of N MOV r2,#0 ;use re for f, set to 0 ADR r3,c ;load r3 with the add of base of c array ADR r5,x ;load r5 with the add of base of x array ;loop body loop LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ADD r2,r2,r4 ;add into running sum f ;update loop counter and array index ADD r8,r8,#4 ;add one word offset to array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<N, continue loop Loopend …. 2 2 or 4

  13. ARM7 execution time(2) • Only branch in loop test may take more than one cycle. • BLT loop takes 1 cycle best case, 3 worst case. • tloop = tinit + N(tbody+tupdate)+(N-1)ttest,worst+ttest,best • Branch Penalty • Delayed branch • Branch Prediction • Branch Folding

  14. Delayed branch • To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not ;loop initiation code …….. ;loop body loop LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ADD r2,r2,r4 ;add into running sum f ;update loop counter and array index ADD r8,r8,#4 ;add one word offset to array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<N, continue loop NOP NOP loopend …. ;loop initiation code …….. ;loop body loop LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ;update loop counter and array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<N, continue loop ADD r2,r2,r4 ;add into running sum f ADD r8,r8,#4 ;add one word offset to array index loopend ….

  15. ARM10 processor execution time • Impossible to describe briefly the exact behavior of all instructions in all circumstances • Branch prediction • Prefetch buffer • Branch folding • The independent Load/Store Unit • Data alignment • How many accesses hit in the cache and TLB

  16. ARM10 integer core 3 instr’s

  17. Branch Folding

  18. Branch Foding(2)

  19. Integer core • Prefetch Unit • Fetches instructions from I-cache or external memory • Predicts the outcome of branches whenever it can • Integer Unit • Decode • Barrel shifter, ALU, Multiplier • Main instruction sequencer • Load/store Unit • Load or store two registers(64bits) per cycle • Decouple from the integer unit after the first access of a LDM or STM instruction • Supports Hit-Under-Miss (HUM) operation

  20. Pipeline • Fetch • I-cache access, branch prediction • Issue • Initial instruction decode • Decode • Final instruction decode, register read for ALU op, forwarding, and initial interlock resolution • Execute • Data address calculation, shift, flag setting, CC check, branch mispredict detection, and store data register read • Memory • Data cache access • Write • Register writes, instruction retirement

  21. Typical operations

  22. Load or store operation

  23. LDR operation that misses

  24. Interlocks • Integer core • forwarding to resolve data dependencies between instructions • Pipeline interlocks • Data dependency interlocks: • Instructions that have a source register that is loaded from memory by the previous instruction • Hardware dependency • A new load waiting for the LSU to finish an existing LDM or STM • A load that misses when the HUM slot is already occupied • A new multiply waiting for a previous multiply to free up the first stage of the multiply

  25. Pipeline forwarding paths

  26. Example of interlocking and forwarding • Execute-to-execute mov r0, #1 add r1, r0, #1 • Memory-to-execute ldrr0, [r5] sub r1, r2, #2 add r2, r0, #1

  27. fetch issue decode execute memory write Example of interlocking and forwarding, cont’d • Single cycle interlock ldr r0, [r1, r2]str r3, [r0, r4] r1+r2 r0 read ldr fetch issue decode execute memory write r0+r4 r3 read r3 write str

  28. Superscalar execution • Superscalar processor can execute several instructions per cycle. • Uses multiple pipelined data paths. • Programs execute faster, but it is harder to determine how much faster.

  29. data dependency Data dependencies • Execution time depends on operands, not just opcode. • Superscalar CPU checks data dependencies dynamically: r0 r1 add r2,r0,r1 add r3,r2,r5 r2 r5 r3

  30. Memory system performance • Caches introduce indeterminacy in execution time • Depends on order of execution • Cache miss penalty: added time due to a cache miss • Several reasons for a miss: compulsory, conflict, capacity

More Related