340 likes | 439 Views
Profile-driven Inlining for Erlang. Thomas Lindgren thomasl_erlang@yahoo.com. Inlining. Replace function call f(X1,…,Xn) with body of f/n Optimization enabler Simplify code Specialize code Remove ”optimization fence” Standard tool in modern compiler toolbox. Inlining.
E N D
Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com
Inlining • Replace function call f(X1,…,Xn) with body of f/n • Optimization enabler • Simplify code • Specialize code • Remove ”optimization fence” • Standard tool in modern compiler toolbox
Inlining • Main problem: which calls to inline? • Code growth reduces performance • Estimate code size growth • Select the best estimated sites subject to cost • Some static estimations: • f/n is small? (= inline cost is small) • Inlining the call to f/n enables optimization • Are we optimizing the important code? • Or just the convenient code?
Inlining • Dynamic estimation • Profile the program • Select the best hot call sites for inlining • Optimize the important code
Our approach • Inlining driven by profiling • Permit cross-module inlining • Computations often span several modules • Code growth measured for whole program • Cross-module optimization enabled by (i) module aggregation and (ii) guarded conversion of remote to local calls • (will not describe this further here) • [Lindgren 98]
The rest of this talk • Overview of method • Performance measurements
Inline forest • Inlinings to be done represented by forest • Nodes are inlined call sites • Leaves are call sites to be checked • (Example shows nested inlining) f g f g h Some sites are not inlined h
Priority-based inlining • All call sites (leaves in inline forest) are placed in priority queue • Priority = estimated number of calls • When a call site f is inlined, the call sites in f are added to the queue • Priority scaled appropriately
Inlining algorithm • Preprocess code • call_site and size maps • Initialize priority queue • Initialize inline forest • While prio queue not empty • Take call site (k, f) • Try to inline it
Preprocessing • for each function visited k times • for each call site visited k’ times • set ratio(call_site) = (k’/k) • Adjust ratio so that < 1.0 • Self-recursive call sites := 0.0 • (improves code quality) • maps (function -> [{call_site, ratio}])
dec_bearer_capability/2 runs 200,000 times • dec_bearer_capability_6 visited 200,000 times • ratio is (200/200) = 1.0 • adjust ratio to 0.99
Inlining a call site • Bookkeeping phase (code gen later) • Call to f(X1,…,Xn), visited k times • k < minimum frequency? stop • tot_size + size(f) > max_size? skip • Otherwise, • tot_size += size(f) • for each call site g of f • add (k * ratio, g) to priority queue • extend node f by call sites g1,…,gn • Iterate until no call sites remain
Example • Inlining applied to decode1 • Protocol decoding • Single module
decode1 decode_ie_coding_1/3 [800k] decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue adjust to 0.99 dec_bearer_capability/2 -> [(dec_bearer_capability_6, 1.00)] decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.2), (decode_ie_heads_setup/5, 0.6)] … Call_site mapping (selected parts) self-recursive so set to 0.0
decode1 decode1 Try to inline decode_ie_coding_1/3 [800k] decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue dec_bearer_capability/2 -> [(dec_bearer_capability_6, 0.99)] decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.0), (decode_ie_heads_setup/5, 0.0)] … Call_site mapping
decode1 decode1 decode1 - decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue
decode1 decode1 decode1 decode1 - - dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue
decode1 decode1 decode1 decode1 decode1 Inline forest Prio queue • Final result: • inline dec_bearer_cap_6/2 into dec_bearer_cap/2 yielding (*) • Inline dec_ie_coding/1, decode_action/1 and (*) into decode_ie_heads_setup/5 • During inlining, one inline was rejected for too much code growth (not shown) Now time for code generation
Code generation • Walk each inline tree from leaf to root • Replace inlined calls f(E1,…,En) with • (fun(X1,…,Xn) -> E end)(E1,…,En) • General case: nested inlines • Simplify the resulting function • Apply fun to arguments (above) • Case-of-case • Case-of-if • …
Measurements • Used five applications • decode1 (small protocol decoder) • ldapv2 (ASN.1 encode/decode) • gen_tcp (send/rcv over socket) • beam (compiler) • mnesia (simulate HLR)
Benchmarks Benchmarks
Benchmarks Benchmarks
Performance • Very preliminary • Code generation problems for beam and mnesia => unable to measure • (Probably due to name capture bug) • Did not use outlining, higher-order specialization, apply open-coding [EUC’01] • Tried only emulated code • Native code compilation failed
Speedup vs baseline Native compilation of inlined decode1 provided a net slowdown
Future work • Integrate with other optimizations • Plenty of opportunities for further source-level simplifications • Suggests new approach to module aggregation • (do it after inlining instead of before) • Tuning, measurements • Bugfixing …
Conclusion • Profile-guided inlining speeds up real code • Whole-program, cross-module inlining probably necessary
Module merging • We want to optimize over several modules at a time • What to do about hot code loading? • Merge modules to aggregates • Convert suitable remote calls into local calls • Guard such calls to preserve code loading semantics • Annotate code regions with ”origin module” to enable precise process purging • Or … extend Erlang appropriately