1 / 19

OPTIMIZING C CODE FOR THE ARM PROCESSOR

OPTIMIZING C CODE FOR THE ARM PROCESSOR. Optimizing code takes time and reduces source code readability Usually done for functions that are critical for performance or power consumption and are executed frequently Usually in combination with profiling. LOCAL VARIABLES.

Download Presentation

OPTIMIZING C CODE FOR THE ARM PROCESSOR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OPTIMIZING C CODE FOR THE ARM PROCESSOR • Optimizing code takes time and reduces source code readability • Usually done for functions that are critical for performance or power consumption and are executed frequently • Usually in combination with profiling

  2. LOCAL VARIABLES • ARM registers are 32-bit. Therefore it is more efficient to use 32-bit data types • Use signed and unsigned integer types and avoid char and short • Only exception is if you want wraparound to occur • Unsigned int is more efficient for division

  3. LOOP STRUCTURES (incrementing for loop) checksum_v5 MOV r2,r0 ; r2=data MOV r0,#0 ; sum=0 MOV r1,#0 ; i=0 checksum_v5_loop LDR r3,[r2],#4 ; r3 = *(data++) ADD r1,r1,#1 ; i++ CMP r1,#0x40 ; compare i, 64 ADD r0, r3, r0 ; sum += r3 BCC checksum_v5_loop ; if (i<64) goto loop MOV pc,r14 ; return sum int checksum_v5(int *data) { unsigned int i; int sum=0; for (i=0; i<64; i++) { sum +=*(data++); } return sum; }

  4. LOOP STRUCTURES (decrementing for loop) checksum_v6 MOV r2,r0 ; r2=data MOV r0,#0 ; sum=0 MOV r1,#0x40 ; i=64 checksum_v6_loop LDR r3,[r2],#4 ; r3 = *(data++) SUBS r1,r1,#1 ; i-- and set flags ADD r0, r3, r0 ; sum += r3 BNE checksum_v6_loop ; if (i!=0) goto loop MOV pc,r14 ; return sum int checksum_v6(int *data) { unsigned int i; int sum=0; for (i=64; i!=0; i--) { sum +=*(data++); } return sum; }

  5. LOOP UNROLLING checksum_v7 MOV r2,#0 ; sum=0 checksum_v6_loop LDR r3,[r2],#4 ; r3 = *(data++) SUBS r1,r1,#4 ; N -=4 and set flags ADD r2, r3, r2 ; sum += r3 LDR r3,[r2],#4 ; r3 = *(data++) ADD r2, r3, r2 ; sum += r3 LDR r3,[r2],#4 ; r3 = *(data++) ADD r2, r3, r2 ; sum += r3 LDR r3,[r2],#4 ; r3 = *(data++) ADD r2, r3, r2 ; sum += r3 BNE checksum_v6_loop ; if (N!=0) goto loop MOV r0,r2 ; r0 = sum MOV pc,r14 ; return r0 int checksum_v7(int *data,unsigned int N) { int sum=0; do { sum +=*(data++); sum +=*(data++); sum +=*(data++); sum +=*(data++); N -=4 } while (N!=0); return sum; }

  6. Loop Unrolling example • Unroll the following loop by a factor of 2, 4, and eight for (i=0; i<64; i++) { a[i] = b[i] + c[i+1]; }

  7. Factor of 2 for (i=0; i<32; i++) { a[2*i] = b[2*i] + c[2*i+1]; a[2*i+1] = b[2*i+1] + c[2*i+1+1]; }

  8. Factor of 4 for (i=0; i<16; i++) { a[4*i] = b[4*i] + c[4*i+1]; a[4*i+1] = b[4*i+1] + c[4*i+1+1]; a[4*i+2] = b[4*i+2] + c[4*i+2+1]; a[4*i+3] = b[4*i+3] + c[4*i+3+1]; }

  9. Factor of 8 for (i=0; i<8; i++) { a[8*i] = b[8*i] + c[8*i+1]; a[8*i+1] = b[8*i+1] + c[8*i+1+1]; a[8*i+2] = b[8*i+2] + c[8*i+2+1]; a[8*i+3] = b[8*i+3] + c[8*i+3+1]; a[8*i+4] = b[8*i+4] + c[8*i+4+1]; a[8*i+5] = b[8*i+5] + c[8*i+5+1]; a[8*i+6] = b[8*i+6] + c[8*i+6+1]; a[8*i+7] = b[8*i+7] + c[8*i+7+1]; }

  10. REGISTER ALLOCATION • Limit the number of local variables in the internal loop of functions to 12 • Use the important variables in the innermost loop to help the compiler

  11. CALLING FUNCTIONS • Try to restrict functions to four arguments. Use structures to group related arguments and pass structure pointers instead • Define small functions in the same source file and before the functions that call them.

  12. REGISTER ALLOCATION • Limit the number of internal loop variables to 12 so they can be stored in registers

  13. SUMMARY • Use signed int and unsigned int types for local variables, function arguments and return values • The most efficient form of loop is the do-while loop that counts down to zero • Unroll important loops • Try to limit functions to four arguments. • Avoid divisions. Use multiplication by reciprocal • Use the inline assembler

  14. ARM INLINE ASSEMBLY int main() { int n1,n2,m; n1=5; n2=3; __asm //inline assembly code { MUL m,n1,n2 } printf("The result is %d\n",m); return(0); }

  15. USING INLINE ASSEMBLY • Used for ARM instructions not supported by the C compiler (coprocessor instruction set extensions) • Creates portability issues

  16. ALTERNATIVE: CALLING ASSEMBLY FUNCTION FROM C #include <stdio.h> extern void multip(int n1, int n2, int m); int main() { int n1,n2,m; n1=5; //Assigning numbers n2=3; multip(n1,n2,m); //calling function printf("The result is\n",m); }

  17. Assembly function AREA example, CODE, READONLY EXPORT multip ;external function name IMPORT n1 ;input IMPORT n2 IMPORT m ;return variable Multip ;function begins LDR r3,=n1 ;load data from memory to registers LDR r1,[r3] LDR r4,=n2 LDR r2,[r4] LDR r5,=m LDR r0,[r5] MUL r0,r1,r2 STR r0,[r5] ;store result to m memory location MOV pc,lr ;return from call END

  18. PORTABILITY ISSUES • Char type: Unsigned on ARM, signed on many other processors • Alignment: ARM lw, sw instructions assume the address is a multiple of the type you are loading or storing • Endianess: Little endian (default), can be configured to big endian • Inline assembly: Separate inline assembly into small inlined functions

  19. EXAMPLE • Write a program that reads 8-element row and column vectors from memory and • Multiplies both by a scalar also found in memory • Calculates the scalar product of the two vectors • Assume no partial product may exceed 32 bits • Use v1= [1 2 3 4 5 6 7 8], v2= [0 1 2 3 4 5 6 7]T, s=5 as test inputs • Unroll the loop by two and four • Repeat using inline assembly for the multiplications

More Related