330 likes | 465 Views
Profile Guided Optimizations in Visual C++ 2005. Andrew Pardoe Phoenix Team (C++ Optimizer). What do optimizers do?. int setArray(int a, int *array) { for(int x = 0; x < a; ++x) array[x] = 0; return x; } The compiler knows nothing about the value of ‘a’
E N D
Profile Guided Optimizations in Visual C++ 2005 Andrew Pardoe Phoenix Team (C++ Optimizer)
What do optimizers do? int setArray(int a, int *array) { for(int x = 0; x < a; ++x) array[x] = 0; return x; } • The compiler knows nothing about the value of ‘a’ • The compiler knows nothing about the array’s alignment • The compiler doesn’t look at all the source files together • The compiler doesn’t know how the program will execute
What is PGO (pronounced PoGO)? • A “profile” details a program’s behavior in a specific scenario • Profile-guided optimizations use the profile to guide the optimizer for that given scenario • PGO tells the optimizer which areas of the application were most frequently executed • This information lets the optimizer be more selective in optimizing the program • PGO has its own set of optimizations as well as improving traditional optimizations
Example of a PGO win • Compiler optimizations make assumptions based on static analysis and standard heuristics • For example, we assume that a loop executes multiple times for (p=list; *p; p=p->next) { p->f = sqrt(F); } • The optimizer would hoist the call to the loop invariant sqrt(F) tmp = sqrt(F); for (p=list; *p; p=p->next) { p->f = tmp; } • If the profile shows that p is zero, we will not hoist the call
How is PGO used? Instrumented binary Source code PGO Probes Profile Instrumented binary Scenarios Profile Optimized binary Source code
How is PGO used? • PGO is built on top of Link-Time Code Generation • Must link object files twice: once for instrumented build, once for optimized build • Can be used on almost all native code • exe, dll, lib • COM/MFC • Windows services • Cannot be used on system or managed code • Drivers or kernel mode code • No code compiled with /CLR • Incorrect scenarios could cause worse optimizations!
PGO profile gathering • Two major themes of PGO profile gathering • Identify “hot paths” in program execution path and optimize to make these paths perform well • Likewise, identify “cold paths” to separate cold code—or dead code—from hot code • Identify “typical” values such as switch values, loop induction variables and targets of indirect calls and optimize code for these values
PGO main optimizations: inlining • Improved inlining heuristics • Inline based on frequency of call, not function size or depth of call stack • “Hot” call sites: inline agressively • “Cold” call sites: only inline if there are other optimization opportunities (such as folding) • “Dead” call sites: only inline the trivial cases
PGO main optimizations: inlining • Speculative inlining: used for virtual call specification • Indirect calls are profiled to find typical targets • An indirect call heavily biased toward certain target(s) can be multi-versioned • The new sequence contains direct call(s) to typical target(s), which can be inlined • Partial inlining: only inline the portions of the callee we execute. If the cold code is called, call the non-inlined function.
PGO main optimizations: code size • Choice of favoring size versus speed made on a per-function basis • Program execution should be dominated by functions optimized for speed and less-frequently used functions should be small • PGO computes a dynamic instruction count for each profiled function. • Inlining effects are taken into account. • Sorts functions in descending order by count. • Functions in the upper 99% of total dynamic instruction count are optimized for speed. Others are compressed. • In large applications (Vista, SQL) most functions are optimized for size.
PGO main optimizations: locality • Reorder the code to “fall through” wherever possible • Intra-function layout reorders basic blocks so that the major trace falls through whenever possible. • Inter-function layout tries to place frequent caller-callee pairs near one another in the image. • Extract “dead” code from the .text section and put it in a remote section of the image • Dead code can be entire functions that are not called or basic blocks inside a function • Penalty for being wrong is very large so the profile must be accurate!
What code benefits most? • C++ programs: many virtual calls can be inlined once the target is determined through profiling • Large applications where size and speed are important • Code with frequent branches that are difficult to predict at compile time • Code which can be separated by profiling into “hot” and “cold” blocks to help instruction cache locality • Code for which you know the typical usage patterns and can produce accurate profiling scenarios
Scenario 1 • Customer compiles with /O2 and gets pretty good performance but wants to take advantage of advanced optimizations like LTCG and PGO • Code is tested by the dev team throughout development cycle using unit and bug regression tests • Customer has done performance measurements of the code. Customer has no automated tests to measure performance but believes it can improve. • Is this customer ready to try PGO? Probably not.
Scenario 2 • Customer has well-defined performance goals and tests set up to measure performance • Customer knows typical usage patterns for the application • Application is being built with LTCG • Most of the execution time is spent in tightly-nested loops doing heavy floating-point calculations • Is this customer ready to use PGO? Maybe…
Scenario 3 • Customer has well-defined performance goals and tests set up to measure performance • Customer knows typical usage patterns for the application • Application is being built with LTCG • Application spends most of its time in branches and calls • Application is fairly large and makes use of inheritance • Is this customer ready to use PGO? Definitely.
Scenario 4 • Customer has a build lab and wants to enable PGO in nightly builds • But profiling every night seems too expensive • Solution: PGO Incremental Update • Avoid running profile scenarios at every build • PGU uses “stale” profile data • Can check in profile data and refresh weekly • PGU restricts optimizations • Functions which have changed will not be optimized • Effects of localized changes are usually negligible
PGO sweeper • Some scenarios are difficult to collect profile data for • Profile scenario may not begin and end with application launch and shutdown • Some components cannot write a file • Some components cannot link to the PGO runtime DLL • PGO sweeper collects profile data from running instrumented processes • This allows you to close a currently open .pgc file and create a new one without exiting the instrumented binary • You get one .pgc file per run or sweep. You can delete any .pgc files you do not want reflected in your scenario.
PGO Manager • PGO manager adds profile data from one or more .pgc files into the .pgd file • The .pgd file is the main profile database • Allows you to profile multiple scenarios (.pgc) for a single codebase into one profile database (.pgd) • PGO manager also lets you generate reports from the .pgd file to see that your scenarios “feel right” in the code • Information in the reports include • Module count, function count, arc and value count • Static (all) instruction count, dynamic (hot) instruction count • Basic block count, average basic block size • Function entry count
How much performance does PGO get? • Performance gain is architecture and application specific • IA64 sees biggest gains • x64 benefits more than x32 • Large applications benefit more than small: SQL server saw over 30% gains through PGO • Many parts of Windows use PGO to balance size vs. speed • If you understand your real-world scenarios and have adequate, repeatable tests PGO is almost always a win • Once your testing is in-place integrating PGO into your build process should be easy
a bar baz Call-graph profiling • Given this call graph, determine which code paths are hot and which are cold foo bat
Call-graph profiling continued • Measure the frequency of calls 10 75 75 a bar baz 20 50 20 foo bar 50 baz 100 15 bat 100 bar baz 15 15
bar baz Call-graph profiling after inlining • Inline functions based on call profile • Highest-frequency calls are (bar, baz) and (bat, bar) 10 a 20 125 foo 100 15 bat bar baz 15
Default layout Optimized layout A A B C C D D B Reordering basic blocks • Change code layout to improve instruction cache locality Execution profile Default layout Optimized layout A 10 100 100 10 B C 10 100 100 10 D
Speculative inlining of virtual calls • Profiling shows the dynamic type of object A in function Func was almost always Foo (and almost never Bar) void Func(Base *A) { … while(true) { … if(type(A) == Foo:Base) { // inline of A->call(); } else // virtual dispatch A->call(); … } } void Bar(Base *A) { … while(true) { … A->call(); … } } class Base { … virtual void call(); } class Foo:Base { … void call(); } class Bar:Base { … void call(); }
Partial inlining Profiling shows that condition Cond favors the left branch over the right branch Basic Block 1 Cond Hot Code Cold Code More Code
Partial inlining concluded We can inline the hot path, and not the cold path. We can make different decisions at each call site! Basic Block 1 Cond Cold Code Hot Code More Code
Using PGO (in more detail) Object files Compile with /GL and opts Source code .PGD file Object files Link with /LTCG:PGI Instrumented binary Instrumented binary Scenarios .PGC file(s) .PGC files Object files Optimized binary Link with /LTCG:PGO .PGD file
PGO tips • The scenarios used to generate the profile data should be real-world scenarios. The scenarios are NOT and attempt to do code coverage. • Using scenarios to train with that are not representative of real-world use can result in code that performs worse than if PGO was not used. • Name the optimized code something different from the instrumented code, for example, app.opt.exe and app.inst.exe. This way you can rerun the instrumented application to supplement your set of scenario profiles without rerunning everything again. • To tweak results, use the /clear option of pgomgr to clear out a .PGD files.
PGO tips • If you have two scenarios that run for different amounts of time, but would like them to be weighted equally, you can use the weight switch (/merge:weight in pgomgr) on .PGC files to adjust them. • You can use the speed switch to change the speed/size thresholds. • You can control the inlining threshold with a switch but use it with care. The values from 0-100 aren't linear. • Integrate PGO into your build process and update scenarios frequently for the most consistent results and best performance increases.
In summary • Using PGO is very easy, with four simple steps • CL to parse the source files • cl /c /O2 /GL *.cpp • LINK / PGI to generate instrumented image • link /ltcg:pgi /pgd:appname.pgd *.obj *.lib • Also generates a PGD file (PGO database) • Run your program on representative scenarios • Generates PGC files (PGO profile data) • LINK / PGO to generate optimized image • Implicitly uses the generated PGC files • link /ltcg:pgo /pgd:appname.pgd *.obj *.lib
More information • Matt Pietrek’s Under the Hood column from May 2002 has a fantastic explanation of LTCG internals • Multiple articles on PGO located on MSDN • The links are long: just search for PGO on MSDN • Look through articles by Kang Su Gatlin on his blog at http://blogs.msdn.com/kangsu or on MSDN • Improvements are coming in the new VC++ backend • Based on the Phoenix optimization framework • Profiling is a major scenario for the Phoenix-based optimizer • There will be a talk on Phoenix later today