250 likes | 354 Views
Automatic Application Profiling. Lecture 22. Today – What parts of the code are slow?. Amdahl’s law How to get the processor to tell us what’s taking the most time – Statistical program counter sampling. Amdahl’s Law. Gene Amdahl - “Optimize the common case”
E N D
Automatic Application Profiling Lecture 22
Today – What parts of the code are slow? • Amdahl’s law • How to get the processor to tell us what’s taking the most time – Statistical program counter sampling
Amdahl’s Law • Gene Amdahl - “Optimize the common case” • Double the speed of ¼ of the program: • Quadruple the speed of ¼ of the program:
Work on the Slow Part • Double the speed of ¾ of the program: • Quadruple the speed of ¾ of the program:
How Do We Find the Slow Parts • Option A: Measure the amount of time each region takes to execute • Code unsigned long times[NUM_FUNCTIONS]; void my_function(int i) { t_start = get_ticks(); /* function body */ times[THIS_FUNCTION_NUM] += get_ticks() - t_start; } • Pros • Exact. Can get single cycle accuracy if needed. • Cons • Tedious. Must add code to each function to be monitored. • Need access to source code, which may be a problem for library functions.
Program Counter Sampling • Option B: Periodically examine the PC to see what’s running • Result shows average fraction of time spent executing a region of code • Supporting data structure: table of region information • Starting and ending addresses: defined before sampling • Execution counts:updated during sampling • Sampling • Use a timer to interrupt application periodically • Within ISR • Read PC off of stack • Examine table of region addresses to determine currently executing region N • Increment entry N of execution count table • Also increment a total number of ticks variable (to reveal out-of-range PC instances) • At end • Provide execution count table to user (via file, serial port, debugger, etc.) typedef struct { char Name[PROFILE_NAME_SIZE]; unsigned long Start, End; unsigned Count; } PROFILE_T;
Configure Timer to Generate Periodic Interrupt • Call this function with desired sampling frequency in Hz void Init_Profiling(unsigned samp_freq) { unsigned long divider; // set up timer A0 to interrupt at samp_freq ta0mr = 0x00; divider = ((unsigned long)MAIN_CLOCK)/samp_freq; if (divider > 0x0ffffl) ta0 = 0x0ffff; else ta0 = (unsigned) (divider & 0x0ffff); DISABLE_IRQ; ta0ic = 1; ENABLE_IRQ; ta0s = 1; }
Profiler Interrupt Service Routine #pragma INTERRUPT/B profile_intr void profile_intr(void) { unsigned char PC_H; unsigned int PC_ML; unsigned long PC; unsigned char i; /* Get PC from stack */ _asm("mov.w 2[FB], $$[FB]", PC_ML); _asm("mov.b 5[FB], $$[FB]", PC_H); PC = PC_H; PC <<= 16; PC += PC_ML; profile_ticks++; /* look up function in table and inc. counter */ for (i=0; i<NUM_PROFILE_REGIONS; i++) { if ((PC >= profiles[i].Start) && (PC <= profiles[i].End)) { profiles[i].Count++; return; } } } • Don’t forget to register ISR in vector table!
Configure Profile Table with Region Information To profile modules (source files) • Where do we find region addresses? Next page. PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0xf0380, UL 0xf0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, }; To profile functions PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main", UL main, UL (main+0x0199), 0}, {"LCD_Erase", UL LCD_Erase_FB, UL (LCD_Erase_FB+0x0240), 0}, {"sin", UL sin, UL (sin+0x0138), 0}, {"LCD_Plot_in_FB", UL LCD_Plot_in_FB, UL (LCD_Plot_in_FB+0x02e3), 0}, {"LCD_Display_FB", UL LCD_Display_FB, UL (LCD_Display_FB+0x0302), 0}, {"DisplayDelay", UL DisplayDelay, UL (DisplayDelay+0x01ad), 0}, {"LCD_write", UL LCD_write, UL (LCD_write+0x017f), 0}, {"", UL 0, UL 0, 0} };
Finding the Region Addresses • Get module addresses from linker’s map file (in debug directory) • Get function lengths (if needed) from .LST file • Second column is start address of each assembly instruction • Subtract function’s first address from its last address to find length program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD 506 ;## # FUNCTION LCD_Erase_FB 507 ;## # FRAME AUTO (y) size 2, offset -4 508 ;## # FRAME AUTO (x) size 2, offset -2 509 ;## # ARG Size(0) Auto Size(4) Context Size(5) 510 511 .align 512 ;## # C_SRC : void LCD_Erase_FB(void) { 513 .glb _LCD_Erase_FB 514 00212 _LCD_Erase_FB: 515 00212 7CF204 enter #04H 516 ;## # C_SRC : for (x=0; x<8; x++) 517 00215 D90BFE mov.w #0000H,-2[FB] ; x P (etc.) 539 ;## # C_SRC : } 540 0023F 7DF2 exitd 541 00241 E7:
Overview of Profiling Approach • Start with the big picture (rough details) and use that to determine where to look next • Profiling sequence • module-level (file-level) • function-level within the most common module • basic block-level within the most common function
Detailed Steps to using profile.c/h • Enable list file creation for each C source file • HEW: Options -> Renesas M16C Standard Toolchain • C Tab -> Category: List. Check –dS and –dSL boxes • Fill in array profiles with region addresses (e.g. names of functions), dummy lengths, and zero counts • Compile • May need to add function prototypes if profiles table is declared before the functions are • Update profiles array with correct starting and ending addresses • Recompile • Run • Examine profiles after running long enough
How Long is Enough? • Complex statistical question • The statistician I asked said “it depends” and changed the subject • So, run it until the digits you care about stop changing • Example: Module-level profiling
Where does the Lab 4 Skeleton Spend its Time? DisplayString(LCD_LINE1," Lab #4 "); DisplayString(LCD_LINE2," Starter"); GRN_LED = LED_ON; while (1) { for (f=6.0; f>0.0; f -= 0.4) { LCD_Erase_FB(); for (i=0; i<DISP_WIDTH_PIXELS; i++) LCD_Plot_in_FB((unsigned char)i, (unsigned char) (3.5*(sin(i/f)+1.0)), 1); LCD_Display_FB(LCD_LINE1); } } We know there are delay loops executed every time the MCU writes to the LCD, but let’s verify how bad they are • Start with modules • Then look at functions in module • Then look at basic blocks within function
Step One: Profile Modules # SECTION ATR TYPE START LENGTH ALIGN MODULENAME program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8 • Define a profile region per module, and one for all the library functions
Step One: Profile Modules PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0x0f0380, UL 0x0f0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };
Step 1 Results • Surprise! The LCD functions aren’t taking up most of the processor’s time! The library functions are instead
Step Two: Profile Library # SECTION ATR TYPE START LENGTH ALIGN MODULENAME program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8 • We only have eight entries in our table, so let’s split up the library into eight regions of about three functions each lib 1 lib 2 lib 3 lib 4 lib 5 lib 6 lib 7 lib 8
Step Two: Profile Library PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"lib 1", UL 0x0f0918, UL 0x0f161b, 0}, {"lib 2", UL 0x0f161c, UL 0x0f1a0b, 0}, {"lib 3", UL 0x0f1a0c, UL 0x0f1c7d, 0}, {"lib 4", UL 0x0f1c7e, UL 0x0f1deb, 0}, {"lib 5", UL 0x0f1dec, UL 0x0f22d9, 0}, {"lib 6", UL 0x0f22da, UL 0x0f2699, 0}, {"lib 7", UL 0x0f269a, UL 0x0f2823, 0}, {"lib 8", UL 0x0f2824, UL 0x0f290d, 0} };
Step Two Results • Functions in group lib 6 are taking the most time, followed by lib 4 and lib 1
Step Three: Profile Top Library Functions # SECTION ATR TYPE START LENGTH ALIGN MODULENAME program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8 • Examine the nine functions in these three groups, grouping two functions together
Step Three: Profile Top Library Functions PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"_F4DIV", UL 0x0f0918, UL 0x0f0b27, 0}, {"_F4TOF8+_F8ADD", UL 0x0f0b28, UL 0x0f161b, 0}, {"_I4DIVU", UL 0x0f1c7e, UL 0x0f1ccb, 0}, {"_I4TOF4", UL 0x0f1ccc, UL 0x0f1ced, 0}, {"_LTOF", UL 0x0f1cee, UL 0x0f1deb, 0}, {"_F4RTOL", UL 0x0f22da, UL 0x0f2339, 0}, {"_F8DIV", UL 0x0f233a, UL 0x0f261d, 0}, {"_F8EQ", UL 0x0f261e, UL 0x0f2699, 0} };
Step Three Results • Most time spent in double precision floating point divide • 3.5*(sin(i/f)+1.0) is culprit. Avoid floating point when possible
Disadvantages of Sampling • Sampling is inexact - not guaranteed to get everything that runs • Code which disables interrupts (e.g. ISRs, OS code) is not measured • Rarely executed code may be missed • Takes time for numbers to settle down • Profile changes based on mode of program • If manually creating table, user needs to update address table with each code change
Implementing Instrumentation • Tedious to do manually for large programs, so automate • Have compiler instrument code for you • gcc and other compilers support profiling using a command line switch • They provide a tool to process the output file to determine how much time each function takes • Can also modify the binary (after compilation): Atom from DEC’s Western Research Lab, Etch, EEL • Tool processes binary files to run your instrumentation procedures for each procedure, basic block, or instruction • For the M16C, what would be best? • Create a program which reads the map file and creates a C file declaring our profiles array with correct region names and addresses • Probably easiest to use a scripting language: sed, awk, perl, (f)lex • Probably not enough memory to instrument all functions, so must be selective