520 likes | 680 Views
Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap. Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005. Presented by: Jeff Da Silva CARG - Aug 2 nd 2005. Motivation.
E N D
Automatic Pool Allocation:Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005 Presented by: Jeff Da Silva CARG - Aug 2nd 2005
Motivation • Computer architecture and compiler research has primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for pointer-based data structures. • e.g. caches, prefetching, loop transformations, etc. • Why? • Compilers have precise knowledge of the runtime layout and traversal patterns associated with arrays. • The layout of heap allocated data structures and traversal patterns can be difficult to statically predict (generally it’s not worth the effort).
Intuition • Improving a dynamic data structure’s spatial locality through program analysis (like shape analysis) is probably too difficult and might not be the best approach. • What if you can somehow influence the layout dynamically so that the data structure is allocated intelligently and possibly appears like a dense array? • A new approach: develop a new technique that operates at the “macroscopic” level • i.e. at the level of the entire data structure rather than individual pointers or objects
What the program creates: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes
What the compiler sees: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes
What the compiler sees: What we want the program to create and the compiler to see: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes
Their Approach: Segregate the Heap • Step #1: Memory Usage Analysis • Build context-sensitive points-to graphs for program • We use a fast unification-based algorithm • Step #2: Automatic Pool Allocation • Segregate memory based on points-to graph nodes • Find lifetime bounds for memory with escape analysis • Preserve points-to graph-to-pool mapping • Step #3: Follow-on pool-specific optimizations • Use segregation and points-to graph for later optimizations
Why Segregate Data Structures? • Primary Goal: Better compiler information & control • Compiler knows where each data structure lives in memory • Compiler knows order of data in memory (in some cases) • Compiler knows type info for heap objects (from points-to info) • Compiler knows which pools point to which other pools • Second Goal: Better performance • Smaller working sets • Improved spatial locality • Especially if allocation order matches traversal order • Sometimes convert irregular strides to regular strides
Contributions of this Paper • First “region inference” technique for C/C++: • Previous work required type-safe programs: ML, Java • Previous work focused on memory management • Region inference driven by pointer analysis: • Enables handling non-type-safe programs • Simplifies handling imperative programs • Simplifies further pool+ptr transformations • New pool-based optimizations: • Exploit per-pool and pool-specific properties • Evaluation of impact on memory hierarchy: • We show that pool allocation reduces working sets
Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimization Performance Impact • Conclusion
Example struct list { list *Next; int *Data; }; list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; } void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } }
Example void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } } Note that lists A and B use distinct heap memory; it would therefore be beneficial if they were allocated using separate pools of memory.
Pool Alloc Runtime Library Interface void poolcreate(Pool *PD, uint Size, uint Align); Initialize a pool descriptor. (obtains one or more pages of memory using malloc) void pooldestroy(Pool *PD); Releases pool memory and destroy pool descriptor. void *poolalloc(Pool *PD, uint numBytes); void poolfree(Pool *PD, void *ptr); void *poolrealloc(Pool *PD, void *ptr, uint numBytes); Interface also uses: poolinit_bp(..), poolalloc_bp(..), pooldestroy_bp(..).
Algorithm Steps • Generate a DS Graph (Points-to Graphs) for each function • Insert code to create and destroy pool descriptors for DS nodes who’s lifetime does not escape a function. • Add pool descriptor arguments for every DS node that escapes a function. • Replace malloc and free with calls to poolalloc and poolfree. • Further refinements and optimizations
Points-To Graph – DS Graph • Builds a points-to graph for each function in Bottom Up [BU] order • Context sensitive naming of heap objects • More advanced than the traditional “allocation callsite” • A unification-based approach • Allows for a fast and scalable analysis • Ensures every pointer points to one unique node • Field Sensitive • Added accuracy • Also used to compute escape info
Example – DS Graph list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; }
P1 P2 Example – DS Graph void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } }
P1 P2 Example – DS Graph void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } }
Example – Transformation list *createnode(Pool *PD, int *Data) { list *New = poolalloc(PD, sizeof(list)); New->Data = Num; return New; return New; }
P1 P2 Example – Transformation void splitclone(Pool *PD1, Pool *PD2, list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(PD1, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); } else { *R2 = createnode(PD2, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); } }
P1 P2 Example – Transformation void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1); pooldestroy(&PD2); }
More Algorithm Details • Indirect Function Call Handling: • Partition functions into equivalence classes: • If F1, F2 have common call-site same class • Merge points-to graphs for each equivalence class • Apply previous transformation unchanged • Global variables pointing to memory nodes • Use global pool variable rather than passing them around through function args.
More Algorithm Details • poolcreate / pooldestroy placement: • Move calls earlier/later by analyzing the pool’s lifetime • Reduces memory usage • Enables poolfree elimination • poolfree elimination • Eliminate unnecessary poolfree calls • No allocations between poolfree & pooldestroy • Behaves like static garbage collection
Example – poolcreate/pooldestroy placement void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1); pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2); }
Example – poolfree Elimination void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; B = tmp; } pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list pooldestroy(&PD1); pooldestroy(&PD2); }
Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimiztion Performance Impact • Conclusion
PAOpts (1/4) and (2/4) • Selective Pool Allocation • Don’t pool allocate when not profitable • Avoids creating and destroying a pool descriptor (minor) and avoids significant wasted space when the object is much smaller than the smallest internal page. • PoolFree Elimination • Remove explicit de-allocations that are not needed
16-byte user data 16-byte user data Looking closely: Anatomy of a heap • Fully general malloc-compatible allocator: • Supports malloc/free/realloc/memalign etc. • Standard malloc overheads: object header, alignment • Allocates slabs of memory with exponential growth • By default, all returned pointers are 8-byte aligned • In memory, things look like (16 byte allocs): 4-byte padding for user-data alignment 4-byte object header 16-byte user data One 32-byte Cache Line
PAOpts (3/4): Bump Pointer Optzn • If a pool has no poolfree’s: • Eliminate per-object header • Eliminate freelist overhead (faster object allocation) • Eliminates 4 bytes of inter-object padding • Pack objects more densely in the cache • Interacts with poolfree elimination (PAOpt 2/4)! • If poolfree elim deletes all frees, BumpPtr can apply 16-byte user data 16-byte user data 16-byte user data 16-byte user data One 32-byte Cache Line
PAOpts (4/4): Alignment Analysis • Malloc must return 8-byte aligned memory: • It has no idea what types will be used in the memory • Some machines bus error, others suffer performance problems for unaligned memory • Type-safe pools infer a type for the pool: • Use 4-byte alignment for pools we know don’t need it • Reduces inter-object padding 4-byte object header 16-byte user data 16-byte user data 16-byte user data 16-byte user data One 32-byte Cache Line
Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimization Performance Impact • Conclusion
Implementation & Infrastructure • Link-time transformation using the LLVM Compiler Infrastructure • Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3 • Evaluated on AMD Athlon MP 2100+ • 64KB L1, 256KB L2
Programs from SPEC CINT2K, Ptrdist, FreeBench & Olden suites, plus unbundled programs Simple Pool Allocation Statistics 91 Table 1
DSA is able to infer that most static pools are type-homogenous Simple Pool Allocation Statistics 91 Table 1
Compilation overhead is less than 3% Compile Time Table 3
Most programs are 0% to 20% faster with pool allocation alone Pool Allocation Speedup • Several programs unaffected by pool allocation (see paper) • Sizable speedup across many pointer intensive programs • Some programs (ft, chomp) order of magnitude faster
Two are 10x faster, one is almost 2x faster Pool Allocation Speedup • Several programs unaffected by pool allocation (see paper) • Sizable speedup across many pointer intensive programs • Some programs (ft, chomp) order of magnitude faster
Most are 5-15% faster with optimizations than with Pool Alloc alone Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)
One is 44% faster, other is 29% faster Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)
Pool optimizations help some progs that pool allocation itself doesn’t Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)
Pool optzns effect can be additive with the pool allocation effect Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)
Cache/TLB miss reduction Miss rate measured with perfctr on AMD Athlon 2100+ Sources: • Defragmented heap • Reduced inter-object padding • Segregating the heap! Figure 10
Pool Optimization Statistics Table 2
Optimization Contribution Figure 11
Pool Allocation Conclusions • Segregate heap based on points-to graph • Improved Memory Hierarchy Performance • Give compiler some control over layout • Give compiler information about locality • Optimize pools based on per-pool properties • Very simple (but useful) optimizations proposed here • Optimizations could be applied to other systems
, Pool* P){ poolalloc(P); , P) Pool* P2 Pool P1; poolinit(&P1); , &P1) , P2) , &P1) , P2) pooldestroy(&P1); P1 P2 Pool Allocation: Example list *makeList(int Num) { list *New = malloc(sizeof(list)); New->Next = Num ? makeList(Num-1) : 0; New->Data = Num; return New; } int twoLists( ) { list *X = makeList(10); list *Y = makeList(100); GL = Y; processList(X); processList(Y); freeList(X); freeList(Y); } Change calls to free into calls to poolfree retain explicit deallocation