1 / 33

EENG 449bG/CPSC 439bG Computer Systems Lecture 13 ARM Performance Issues and Programming

EENG 449bG/CPSC 439bG Computer Systems Lecture 13 ARM Performance Issues and Programming. February 24, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/eeng449bG. ARM Thumb Benchmark Performance. Dhrystone benchmark result Memory System Performance. ARM vs. THUMB.

regina
Download Presentation

EENG 449bG/CPSC 439bG Computer Systems Lecture 13 ARM Performance Issues and Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EENG 449bG/CPSC 439bG Computer SystemsLecture 13ARM Performance Issues and Programming February 24, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/eeng449bG

  2. ARM Thumb Benchmark Performance • Dhrystone benchmark result • Memory System Performance

  3. ARM vs. THUMB • “THUMB code will provide up to 65% of the code size and 160% of an equivalent ARM connected to a 16-bit memory system”. • Advatage of ARM over THUMB • Able to manipulate 32-bit integers in a single instruction • THUMB’s advantage over 32-bit architectures with 16-bit instructions • It can swith back and forth between 16-bit and 32-bit instructions • Fast interrupts & DSP Algorithms can be implemented in 32-bits and processor can switch back and forth

  4. Not the case when you have loads and stores!!!!

  5. Optimizing Code Execution in Hardware • ARM7 uses a three stage pipeline • Each instruction takes 3 cycles to execute but has a CPI of 1 • 2 Possible ways to increase performance • Increase CPU frequency • Reduce CPI (increase pipeline stages & optimizations)

  6. Where are the other bottlenecks? • Exceptions • Stalls due to Memory Accesses • Inefficient Software

  7. Exceptions & Memory Performance • Refer to handout from last class for exceptions discussion • Lec 11 of handout for exceptions • Lec 10 of handout for memory

  8. Exception Priorities & Latencies • Exception Priorities 1. Reset 2. Data Abort 3. FIQ 4. IRQ 5. Prefetch Abort 6.Software Interrupt • Interrupt Latencies • FIQ Worst case: Time to pass through synchronizer + time for longest instruction to complete +time for data abort entry + time for FIQ entry = 1.4us on a 20MHz processor • IRQ Worst case Same as FIQ but if FIQ occurs right before, then you need 2xFIQ latency

  9. Interrupts • FIQ – does not require saving context. • ARM has sufficient state to bypass this • When leaving the interrupt handler, program should execute • SUBS PC, R14_fiq,#4 • IRQ – lower priority, masked out when FIQ is entered • When leaving the hander, program should execute • SUBS PC,R14_irq,#4 • Software Interrupt: SWI • Returning handler should execute • MOV PC, R14_svc

  10. Exceptions

  11. Instruction Latencies

  12. Bus Cycle Types • Nonsequential • requests a transfer to and from an address which is unrelated to the address used in the preceding cycle • Sequencial • Requests a transfer to or from an address which is either the same, one word or one halfword grater than the address used in the preceding cycle • Internal • Does not require transfer because it is performing an internal function

  13. Optimizing Code In Software • Try to perform similar optimizations as the compiler does • Loop unrolling • Eliminate redundant code • Optimize memory accesses • Use multiple load and store instructions • Avoid using 32-bit data types- this will reduce your load and store performance.

  14. Note that we are switching to MIPS architecture to discuss software optimizations…

  15. Running Example • This code, adds a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; • Assume following latency all examples Instruction Instruction Execution Latency producing result using result in cycles in cycles FP ALU op Another FP ALU op 4 3 FP ALU op Store double 3 2 Load double FP ALU op 1 1 Load double Store double 1 0 Integer op Integer op 1 0

  16. FP Loop: Where are the Hazards? • First translate into MIPS code: • -To simplify, assume 8 is lowest address Loop: L.D F0,0(R1) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D 0(R1),F4 ;store result DADDUI R1,R1,#-8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Where are the stalls?

  17. FP Loop Showing Stalls 1 Loop: L.D F0,0(R1) ;F0=vector element 2 stall 3 ADD.D F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 S.D F4, 0(R1) ;store result 7 DADDUI R1,R1,#-8 ;decrement pointer 8B (DW) 8 stall 9 BNE R1,Loop ;branch R1!=zero 10 stall ;delayed branch slot • 10 clocks: Rewrite code to minimize stalls? Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1

  18. Revised FP Loop Minimizing Stalls 1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,#-8 3 ADD.D F4,F0,F2 4 stall 5 BNE R1,R2, Loop ;delayed branch 6 S.D F4, 8(R1) 6 clocks, but just 3 for execution, 3 for loop overhead; How make faster? Swap BNE and S.D by changing address of S.D Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1

  19. Unroll Loop Four Times (straightforward way) 1 cycle stall 1 Loop: L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D 0(R1),F4 ;drop DADDUI & BNE 4 L.D F6,-8(R1) 5 ADD.D F8,F6,F2 6 S.D F8,-8(R1) ;drop DADDUI & BNE 7 L.D F10,-16(R1) 8 ADD.D F12,F10,F2 9 S.D F12,-16(R1) ;drop DADDUI & BNE 10 L.D F14,-24(R1) 11 ADD.D F16,F14,F2 12 S.D F16,-24(R1) 13 DADDUI R1,R1,#-32 ;alter to 4*8 14 BNE R1,LOOP 14 + (4 x (1+2))+ 2= 28 clock cycles, or 7 per iteration 2 cycles stall Rewrite loop to minimize stalls? 1 cycle stall 1 cycle stall (delayed branch)

  20. Unrolled Loop Detail • Do not usually know upper bound of loop • Suppose it is n, and we would like to unroll the loop to make k copies of the body • Instead of a single unrolled loop, we generate a pair of consecutive loops: • 1st executes (n mod k) times and has a body that is the original loop • 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times • For large values of n, most of the execution time will be spent in the unrolled loop • Problem: Although it improves execution performance, it increases the code size substantially!

  21. Unrolled Loop That Minimizes Stalls(scheduled based on the latencies from slide 4) 1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D F4, 0(R1) 10 S.D F8, -8(R1) 11 S.D F12, -16(R1) 12 DADDUI R1,R1,#-32 13 BNE R1,LOOP 14 S.D F16, 8(R1) ; 8-32 = -24 14 clock cycles, or 3.5 per iteration Better than 7 before scheduling and 6 when scheduled and not unrolled • What assumptions made when moved code? • OK to move store past DSUBUI even though changes register • OK to move loads before stores: get right data? • When is it safe for compiler to do such changes?

  22. Compiler Perspectives on Code Movement • Compiler concerned about dependencies in program • Whether or not a HW hazard depends on pipeline • Try to schedule to avoid hazards that cause performance losses • (True) Data dependencies (RAW if a hazard for HW) • Instruction i produces a result used by instruction j, or • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • If dependent, can’t execute in parallel • Easy to determine for registers (fixed names) • Hard for memory (“memory disambiguation” problem): • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)?

  23. Compiler Perspectives on Code Movement • Name Dependencies are Hard to discover for Memory Accesses • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? • Our example required compiler to know that if R1 doesn’t change then:0(R1)  -8(R1)  -16(R1)  -24(R1) There were no dependencies between some loads and stores so they could be moved by each other

  24. Steps Compiler Performed to Unroll • Check OK to move the S.D after DADDUI and BNEZ, and find amount to adjust S.D offset • Determine unrolling the loop would be useful by finding that the loop iterations were independent • Rename registers to avoid name dependencies • Eliminate extra test and branch instructions and adjust the loop termination and iteration code • Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent • requires analyzing memory addresses and finding that they do not refer to the same address. • Schedule the code, preserving any dependences needed to yield same result as the original code

  25. Where are the name dependencies? 1 Loop: L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) ;drop DADDUI & BNE 4 L.D F0,-8(R1) 5 ADD.D F4,F0,F2 6 S.D F4, -8(R1) ;drop DADDUI & BNE 7 L.D F0,-16(R1) 8 ADD.D F4,F0,F2 9 S.D F4, -16(R1) ;drop DADDUI & BNE 10 L.D F0,-24(R1) 11 ADD.D F4,F0,F2 12 S.D F4, -24(R1) 13 DADDUI R1,R1,#-32 ;alter to 4*8 14 BNE R1,LOOP 15 NOP How can remove them? (See pg. 310 of text)

  26. Where are the name dependencies? 1 Loop: L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D 0(R1),F4 ;drop DSUBUI & BNEZ 4 L.D F6,-8(R1) 5 ADD.D F8,F6,F2 6 S.D -8(R1),F8 ;drop DSUBUI & BNEZ 7 L.D F10,-16(R1) 8 ADD.D F12,F10,F2 9 S.D -16(R1),F12 ;drop DSUBUI & BNEZ 10 L.D F14,-24(R1) 11 ADD.D F16,F14,F2 12 S.D -24(R1),F16 13 DSUBUI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP The Orginal“register renaming” – instruction execution can be overlapped or in parallel

  27. Limits to Loop Unrolling • Decrease in the amount of loop overhead amortized with each unroll – After a few unrolls the loop overhead amortization is very small • Code size limitations – memory is not infinite especially in embedded systems • Compiler limitations – shortfall in registers due to excessive unrolling – register pressure – optimized code may loose its advantage due to the lack of registers

  28. Static Branch Prediction • Simplest: Predict taken • average misprediction rate = untaken branch frequency, which for the SPEC programs is 34%. • Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%) • Predict on the basis of branch direction? • choosing backward-going branches to be taken (loop) • forward-going branches to be not taken (if) • SPEC programs, however, most forward-going branches are taken => predict taken is better • Predict branches on the basis of profile information collected from earlier runs • Misprediction varies from 5% to 22%

  29. Next Time • Quiz – no regular lecture • March 3 – hardware optimizations and HW ILP

More Related