230 likes | 248 Views
A study on implementing dynamic code layout using activation order to improve instruction reference locality, eliminating the need for profile information required by current solutions.
E N D
Improving Instruction Locality with Just-In-Time Code Layout J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University
Improve instruction reference locality big problem for commodity applications Eliminate need for profile information required by current compiler-based solutions Goals
Implement layout dynamically using Activation Order: A new heuristic for code layout. Locate procedures in order of use. How?
No special hardware support. Minimal changes to the operating system. Minimal system overhead. Requirements
Optimizing Procedure Layout Bad Layout Better Layout
Nodes are procedures. Edges are caller/callee pairs. Weights are call frequency. WinMain() 1 1 Initialize() EventLoop() 129394 68754 GetEvent() React() 128404 1 CheckForInputError() 68753 HandleRareCase() 10 HandleCommonCase() HandleInputError() Current Practice: Pettis and Hansen
EventLoop() 129394 68754 EventLoop() Node-2 GetEvent() 129394 68754 68754 React() Node-1 128404 React() React() CheckForInputError() 68753 68753 68753 HandleCommonCase() HandleCommonCase() HandleCommonCase() Node-3 Node-4 68753 HandleCommonCase() Pettis and Hansen Layout layout: [] layout: [GetEvent, CheckForInputErrors] layout: [EventLoop, GetEvent, CheckForInputErrors] layout: [React, EventLoop, GetEvent, CheckForInputErrors] layout: [HandleCommonCase, React, EventLoop, GetEvent, CheckForInputErrors]
A New Heuristic Activation Order: Co-locate procedures that are activated sequentially. Example:
Implementing JITCL __start: perform initializations call thunk_main thunk_main: . . . thunk_foo: . . . __InstructionMemory: Thunk routines implement code layout on-the-fly.
Thunk routines // Global variables: // ProcPointers[] - one element per procedure // INDEX_proc and LENGTH_proc for each procedure thunk_main: if (InCodeSegment(ProcPointers[INDEX_main])) ProcPointers[INDEX_main] = CopyToTextSegment(ProcPointer[INDEX_main], LENGTH_main); PatchCallSite(ProcPointer[INDEX_main], ComputeCallSiteFromReturnAddress(RA)); jmp ProcPointer[INDEX_main]; The thunk routines copy procedures into the text segment and update call sites at run-time.
UNIX/RISC Win32/x86 Cache Size 8K 8K Associativity Direct-Mapped 2-Way Simulation ATOM Etch Simulation Methodology
The AO heuristic is effective. The overhead of JITCL is negligible. JITCL improves procedure layout without requiring profile information. JITCL reduces program memory requirements. Results
Results: The AO Heuristic Improvement in I-Cache Miss Rate Conclusion: Effectiveness of heuristic is comparable to P&H.
Copy overhead instruction overhead cache overhead Cache consistency Disk overhead - comparable to demand loaded text; not evaluated. Overhead of JITCL
Results: Overhead Overhead Instructions (%) Conclusion: JITCL Overhead is less than 0.1% in all cases.
Results: Performance Saved Cycles per Instruction Conclusion: Overall performance is comparable to P&H.
Windows applications are composed of multiple executable modules. When transitions between modules are frequent, intra-module code layout is less effective. With JITCL, inter-module code layout is possible and beneficial. JITCL for Win32 Applications
Win32 Cache Miss Rates Conclusion: Careful layout did not help Win32 applications.
Text Segment Size Text size in megabytes Conclusion: JITCL typically reduces text size by 50%.
JITCL provides an alternative to feedback-based procedure layout. Many important optimizations still require profile information. instruction scheduling register allocation other intra-procedural optimizations Don’t expect profile-based optimization to go away! JITCL vs. PBO
Just-In-Time code layout achieves comparable benefit to profile-based code layout without the need for profiles. The AO heuristic is effective. The overhead of procedure copying is low. Benefit in I-Cache is comparable to Pettis and Hansen layout. JITCL can reduce working set size. Conclusions
M o r p h The Morph Project For more information: http://www.eecs.harvard.edu/morph/