Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Automatic Pool Allocation:Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005 Presented by: Jeff Da Silva CARG - Aug 2nd 2005

Motivation • Computer architecture and compiler research has primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for pointer-based data structures. • e.g. caches, prefetching, loop transformations, etc. • Why? • Compilers have precise knowledge of the runtime layout and traversal patterns associated with arrays. • The layout of heap allocated data structures and traversal patterns can be difficult to statically predict (generally it’s not worth the effort).

Intuition • Improving a dynamic data structure’s spatial locality through program analysis (like shape analysis) is probably too difficult and might not be the best approach. • What if you can somehow influence the layout dynamically so that the data structure is allocated intelligently and possibly appears like a dense array? • A new approach: develop a new technique that operates at the “macroscopic” level • i.e. at the level of the entire data structure rather than individual pointers or objects

What the program creates: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes

What the compiler sees: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes

What the compiler sees: What we want the program to create and the compiler to see: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes

Their Approach: Segregate the Heap • Step #1: Memory Usage Analysis • Build context-sensitive points-to graphs for program • We use a fast unification-based algorithm • Step #2: Automatic Pool Allocation • Segregate memory based on points-to graph nodes • Find lifetime bounds for memory with escape analysis • Preserve points-to graph-to-pool mapping • Step #3: Follow-on pool-specific optimizations • Use segregation and points-to graph for later optimizations

Why Segregate Data Structures? • Primary Goal: Better compiler information & control • Compiler knows where each data structure lives in memory • Compiler knows order of data in memory (in some cases) • Compiler knows type info for heap objects (from points-to info) • Compiler knows which pools point to which other pools • Second Goal: Better performance • Smaller working sets • Improved spatial locality • Especially if allocation order matches traversal order • Sometimes convert irregular strides to regular strides

Contributions of this Paper • First “region inference” technique for C/C++: • Previous work required type-safe programs: ML, Java • Previous work focused on memory management • Region inference driven by pointer analysis: • Enables handling non-type-safe programs • Simplifies handling imperative programs • Simplifies further pool+ptr transformations • New pool-based optimizations: • Exploit per-pool and pool-specific properties • Evaluation of impact on memory hierarchy: • We show that pool allocation reduces working sets

Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimization Performance Impact • Conclusion

Example struct list { list *Next; int *Data; }; list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; } void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } }

Example void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } } Note that lists A and B use distinct heap memory; it would therefore be beneficial if they were allocated using separate pools of memory.

Pool Alloc Runtime Library Interface void poolcreate(Pool *PD, uint Size, uint Align); Initialize a pool descriptor. (obtains one or more pages of memory using malloc) void pooldestroy(Pool *PD); Releases pool memory and destroy pool descriptor. void *poolalloc(Pool *PD, uint numBytes); void poolfree(Pool *PD, void *ptr); void *poolrealloc(Pool *PD, void *ptr, uint numBytes); Interface also uses: poolinit_bp(..), poolalloc_bp(..), pooldestroy_bp(..).

Algorithm Steps • Generate a DS Graph (Points-to Graphs) for each function • Insert code to create and destroy pool descriptors for DS nodes who’s lifetime does not escape a function. • Add pool descriptor arguments for every DS node that escapes a function. • Replace malloc and free with calls to poolalloc and poolfree. • Further refinements and optimizations

Points-To Graph – DS Graph • Builds a points-to graph for each function in Bottom Up [BU] order • Context sensitive naming of heap objects • More advanced than the traditional “allocation callsite” • A unification-based approach • Allows for a fast and scalable analysis • Ensures every pointer points to one unique node • Field Sensitive • Added accuracy • Also used to compute escape info

Example – DS Graph list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; }

P1 P2 Example – DS Graph void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } }

P1 P2 Example – DS Graph void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } }

Example – Transformation list *createnode(Pool *PD, int *Data) { list *New = poolalloc(PD, sizeof(list)); New->Data = Num; return New; return New; }

P1 P2 Example – Transformation void splitclone(Pool *PD1, Pool *PD2, list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(PD1, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); } else { *R2 = createnode(PD2, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); } }

P1 P2 Example – Transformation void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1); pooldestroy(&PD2); }

More Algorithm Details • Indirect Function Call Handling: • Partition functions into equivalence classes: • If F1, F2 have common call-site same class • Merge points-to graphs for each equivalence class • Apply previous transformation unchanged • Global variables pointing to memory nodes • Use global pool variable rather than passing them around through function args.

More Algorithm Details • poolcreate / pooldestroy placement: • Move calls earlier/later by analyzing the pool’s lifetime • Reduces memory usage • Enables poolfree elimination • poolfree elimination • Eliminate unnecessary poolfree calls • No allocations between poolfree & pooldestroy • Behaves like static garbage collection

Example – poolcreate/pooldestroy placement void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1); pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2); }

Example – poolfree Elimination void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; B = tmp; } pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list pooldestroy(&PD1); pooldestroy(&PD2); }

Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimiztion Performance Impact • Conclusion

PAOpts (1/4) and (2/4) • Selective Pool Allocation • Don’t pool allocate when not profitable • Avoids creating and destroying a pool descriptor (minor) and avoids significant wasted space when the object is much smaller than the smallest internal page. • PoolFree Elimination • Remove explicit de-allocations that are not needed

16-byte user data 16-byte user data Looking closely: Anatomy of a heap • Fully general malloc-compatible allocator: • Supports malloc/free/realloc/memalign etc. • Standard malloc overheads: object header, alignment • Allocates slabs of memory with exponential growth • By default, all returned pointers are 8-byte aligned • In memory, things look like (16 byte allocs): 4-byte padding for user-data alignment 4-byte object header 16-byte user data One 32-byte Cache Line

PAOpts (3/4): Bump Pointer Optzn • If a pool has no poolfree’s: • Eliminate per-object header • Eliminate freelist overhead (faster object allocation) • Eliminates 4 bytes of inter-object padding • Pack objects more densely in the cache • Interacts with poolfree elimination (PAOpt 2/4)! • If poolfree elim deletes all frees, BumpPtr can apply 16-byte user data 16-byte user data 16-byte user data 16-byte user data One 32-byte Cache Line

PAOpts (4/4): Alignment Analysis • Malloc must return 8-byte aligned memory: • It has no idea what types will be used in the memory • Some machines bus error, others suffer performance problems for unaligned memory • Type-safe pools infer a type for the pool: • Use 4-byte alignment for pools we know don’t need it • Reduces inter-object padding 4-byte object header 16-byte user data 16-byte user data 16-byte user data 16-byte user data One 32-byte Cache Line

Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimization Performance Impact • Conclusion

Implementation & Infrastructure • Link-time transformation using the LLVM Compiler Infrastructure • Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3 • Evaluated on AMD Athlon MP 2100+ • 64KB L1, 256KB L2

Programs from SPEC CINT2K, Ptrdist, FreeBench & Olden suites, plus unbundled programs Simple Pool Allocation Statistics 91 Table 1

DSA is able to infer that most static pools are type-homogenous Simple Pool Allocation Statistics 91 Table 1

Compilation overhead is less than 3% Compile Time Table 3

Most programs are 0% to 20% faster with pool allocation alone Pool Allocation Speedup • Several programs unaffected by pool allocation (see paper) • Sizable speedup across many pointer intensive programs • Some programs (ft, chomp) order of magnitude faster

Two are 10x faster, one is almost 2x faster Pool Allocation Speedup • Several programs unaffected by pool allocation (see paper) • Sizable speedup across many pointer intensive programs • Some programs (ft, chomp) order of magnitude faster

Most are 5-15% faster with optimizations than with Pool Alloc alone Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

One is 44% faster, other is 29% faster Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

Pool optimizations help some progs that pool allocation itself doesn’t Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

Pool optzns effect can be additive with the pool allocation effect Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

Cache/TLB miss reduction Miss rate measured with perfctr on AMD Athlon 2100+ Sources: • Defragmented heap • Reduced inter-object padding • Segregating the heap! Figure 10

Pool Optimization Statistics Table 2

Optimization Contribution Figure 11

Pool Allocation Conclusions • Segregate heap based on points-to graph • Improved Memory Hierarchy Performance • Give compiler some control over layout • Give compiler information about locality • Optimize pools based on per-pool properties • Very simple (but useful) optimizations proposed here • Optimizations could be applied to other systems

The End

Backup Slides

Table 4

Table 5

, Pool* P){ poolalloc(P); , P) Pool* P2 Pool P1; poolinit(&P1); , &P1) , P2) , &P1) , P2) pooldestroy(&P1); P1 P2 Pool Allocation: Example list *makeList(int Num) { list *New = malloc(sizeof(list)); New->Next = Num ? makeList(Num-1) : 0; New->Data = Num; return New; } int twoLists( ) { list *X = makeList(10); list *Y = makeList(100); GL = Y; processList(X); processList(Y); freeList(X); freeList(Y); } Change calls to free into calls to poolfree  retain explicit deallocation

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Presentation Transcript

Shared Pool Waits

An Introduction to Latent Dirichlet Allocation (LDA)

Layout and Flow

Windows File Systems and Improving Performance

Operations Management Layout Strategy Chapter 9

Improving Performance in Practice National Conference Day 2

LESSON 20: FACILITIES LAYOUT AND LOCATION

Systematic Layout Planning

Automatic Transmission Fundamentals

Improving Performance: The Five Step Process

Managing Performance

Improving and Maximizing Employee Performance

Third SPL Collaboration Meeting

Heaps

Dynamic Memory Allocation

Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization

IMPROVING COLLEGE ACCESS AND SUCCESS: Lessons from Institutions on the Performance Frontier

Improving Performance: The Five Step Process

Introduction to Data Structures

Compiler course

Automatic Voltage Regulator