1 / 52

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap. Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005. Presented by: Jeff Da Silva CARG - Aug 2 nd 2005. Motivation.

aulani
Download Presentation

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Pool Allocation:Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005 Presented by: Jeff Da Silva CARG - Aug 2nd 2005

  2. Motivation • Computer architecture and compiler research has primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for pointer-based data structures. • e.g. caches, prefetching, loop transformations, etc. • Why? • Compilers have precise knowledge of the runtime layout and traversal patterns associated with arrays. • The layout of heap allocated data structures and traversal patterns can be difficult to statically predict (generally it’s not worth the effort).

  3. Intuition • Improving a dynamic data structure’s spatial locality through program analysis (like shape analysis) is probably too difficult and might not be the best approach. • What if you can somehow influence the layout dynamically so that the data structure is allocated intelligently and possibly appears like a dense array? • A new approach: develop a new technique that operates at the “macroscopic” level • i.e. at the level of the entire data structure rather than individual pointers or objects

  4. What the program creates: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes

  5. What the compiler sees: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes

  6. What the compiler sees: What we want the program to create and the compiler to see: What is the problem? List 1 Nodes List 2 Nodes Tree Nodes

  7. Their Approach: Segregate the Heap • Step #1: Memory Usage Analysis • Build context-sensitive points-to graphs for program • We use a fast unification-based algorithm • Step #2: Automatic Pool Allocation • Segregate memory based on points-to graph nodes • Find lifetime bounds for memory with escape analysis • Preserve points-to graph-to-pool mapping • Step #3: Follow-on pool-specific optimizations • Use segregation and points-to graph for later optimizations

  8. Why Segregate Data Structures? • Primary Goal: Better compiler information & control • Compiler knows where each data structure lives in memory • Compiler knows order of data in memory (in some cases) • Compiler knows type info for heap objects (from points-to info) • Compiler knows which pools point to which other pools • Second Goal: Better performance • Smaller working sets • Improved spatial locality • Especially if allocation order matches traversal order • Sometimes convert irregular strides to regular strides

  9. Contributions of this Paper • First “region inference” technique for C/C++: • Previous work required type-safe programs: ML, Java • Previous work focused on memory management • Region inference driven by pointer analysis: • Enables handling non-type-safe programs • Simplifies handling imperative programs • Simplifies further pool+ptr transformations • New pool-based optimizations: • Exploit per-pool and pool-specific properties • Evaluation of impact on memory hierarchy: • We show that pool allocation reduces working sets

  10. Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimization Performance Impact • Conclusion

  11. Example struct list { list *Next; int *Data; }; list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; } void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } }

  12. Example void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } } Note that lists A and B use distinct heap memory; it would therefore be beneficial if they were allocated using separate pools of memory.

  13. Pool Alloc Runtime Library Interface void poolcreate(Pool *PD, uint Size, uint Align); Initialize a pool descriptor. (obtains one or more pages of memory using malloc) void pooldestroy(Pool *PD); Releases pool memory and destroy pool descriptor. void *poolalloc(Pool *PD, uint numBytes); void poolfree(Pool *PD, void *ptr); void *poolrealloc(Pool *PD, void *ptr, uint numBytes); Interface also uses: poolinit_bp(..), poolalloc_bp(..), pooldestroy_bp(..).

  14. Algorithm Steps • Generate a DS Graph (Points-to Graphs) for each function • Insert code to create and destroy pool descriptors for DS nodes who’s lifetime does not escape a function. • Add pool descriptor arguments for every DS node that escapes a function. • Replace malloc and free with calls to poolalloc and poolfree. • Further refinements and optimizations

  15. Points-To Graph – DS Graph • Builds a points-to graph for each function in Bottom Up [BU] order • Context sensitive naming of heap objects • More advanced than the traditional “allocation callsite” • A unification-based approach • Allows for a fast and scalable analysis • Ensures every pointer points to one unique node • Field Sensitive • Added accuracy • Also used to compute escape info

  16. Example – DS Graph list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; }

  17. P1 P2 Example – DS Graph void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } }

  18. P1 P2 Example – DS Graph void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } }

  19. Example – Transformation list *createnode(Pool *PD, int *Data) { list *New = poolalloc(PD, sizeof(list)); New->Data = Num; return New; return New; }

  20. P1 P2 Example – Transformation void splitclone(Pool *PD1, Pool *PD2, list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(PD1, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); } else { *R2 = createnode(PD2, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); } }

  21. P1 P2 Example – Transformation void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1); pooldestroy(&PD2); }

  22. More Algorithm Details • Indirect Function Call Handling: • Partition functions into equivalence classes: • If F1, F2 have common call-site same class • Merge points-to graphs for each equivalence class • Apply previous transformation unchanged • Global variables pointing to memory nodes • Use global pool variable rather than passing them around through function args.

  23. More Algorithm Details • poolcreate / pooldestroy placement: • Move calls earlier/later by analyzing the pool’s lifetime • Reduces memory usage • Enables poolfree elimination • poolfree elimination • Eliminate unnecessary poolfree calls • No allocations between poolfree & pooldestroy • Behaves like static garbage collection

  24. Example – poolcreate/pooldestroy placement void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1); pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2); }

  25. Example – poolfree Elimination void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; B = tmp; } pooldestroy(&PD2); } void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list pooldestroy(&PD1); pooldestroy(&PD2); }

  26. Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimiztion Performance Impact • Conclusion

  27. PAOpts (1/4) and (2/4) • Selective Pool Allocation • Don’t pool allocate when not profitable • Avoids creating and destroying a pool descriptor (minor) and avoids significant wasted space when the object is much smaller than the smallest internal page. • PoolFree Elimination • Remove explicit de-allocations that are not needed

  28. 16-byte user data 16-byte user data Looking closely: Anatomy of a heap • Fully general malloc-compatible allocator: • Supports malloc/free/realloc/memalign etc. • Standard malloc overheads: object header, alignment • Allocates slabs of memory with exponential growth • By default, all returned pointers are 8-byte aligned • In memory, things look like (16 byte allocs): 4-byte padding for user-data alignment 4-byte object header 16-byte user data One 32-byte Cache Line

  29. PAOpts (3/4): Bump Pointer Optzn • If a pool has no poolfree’s: • Eliminate per-object header • Eliminate freelist overhead (faster object allocation) • Eliminates 4 bytes of inter-object padding • Pack objects more densely in the cache • Interacts with poolfree elimination (PAOpt 2/4)! • If poolfree elim deletes all frees, BumpPtr can apply 16-byte user data 16-byte user data 16-byte user data 16-byte user data One 32-byte Cache Line

  30. PAOpts (4/4): Alignment Analysis • Malloc must return 8-byte aligned memory: • It has no idea what types will be used in the memory • Some machines bus error, others suffer performance problems for unaligned memory • Type-safe pools infer a type for the pool: • Use 4-byte alignment for pools we know don’t need it • Reduces inter-object padding 4-byte object header 16-byte user data 16-byte user data 16-byte user data 16-byte user data One 32-byte Cache Line

  31. Outline • Introduction & Motivation • Automatic Pool Allocation Transformation • Pool Allocation-Based Optimizations • Pool Allocation & Optimization Performance Impact • Conclusion

  32. Implementation & Infrastructure • Link-time transformation using the LLVM Compiler Infrastructure • Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3 • Evaluated on AMD Athlon MP 2100+ • 64KB L1, 256KB L2

  33. Programs from SPEC CINT2K, Ptrdist, FreeBench & Olden suites, plus unbundled programs Simple Pool Allocation Statistics 91 Table 1

  34. DSA is able to infer that most static pools are type-homogenous Simple Pool Allocation Statistics 91 Table 1

  35. Compilation overhead is less than 3% Compile Time Table 3

  36. Most programs are 0% to 20% faster with pool allocation alone Pool Allocation Speedup • Several programs unaffected by pool allocation (see paper) • Sizable speedup across many pointer intensive programs • Some programs (ft, chomp) order of magnitude faster

  37. Two are 10x faster, one is almost 2x faster Pool Allocation Speedup • Several programs unaffected by pool allocation (see paper) • Sizable speedup across many pointer intensive programs • Some programs (ft, chomp) order of magnitude faster

  38. Most are 5-15% faster with optimizations than with Pool Alloc alone Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

  39. One is 44% faster, other is 29% faster Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

  40. Pool optimizations help some progs that pool allocation itself doesn’t Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

  41. Pool optzns effect can be additive with the pool allocation effect Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation • Optimizations help all of these programs: • Despite being very simple, they make a big impact PA Time Figure 9 (with different baseline)

  42. Cache/TLB miss reduction Miss rate measured with perfctr on AMD Athlon 2100+ Sources: • Defragmented heap • Reduced inter-object padding • Segregating the heap! Figure 10

  43. Pool Optimization Statistics Table 2

  44. Optimization Contribution Figure 11

  45. Pool Allocation Conclusions • Segregate heap based on points-to graph • Improved Memory Hierarchy Performance • Give compiler some control over layout • Give compiler information about locality • Optimize pools based on per-pool properties • Very simple (but useful) optimizations proposed here • Optimizations could be applied to other systems

  46. The End

  47. Backup Slides

  48. Table 4

  49. Table 5

  50. , Pool* P){ poolalloc(P); , P) Pool* P2 Pool P1; poolinit(&P1); , &P1) , P2) , &P1) , P2) pooldestroy(&P1); P1 P2 Pool Allocation: Example list *makeList(int Num) { list *New = malloc(sizeof(list)); New->Next = Num ? makeList(Num-1) : 0; New->Data = Num; return New; } int twoLists( ) { list *X = makeList(10); list *Y = makeList(100); GL = Y; processList(X); processList(Y); freeList(X); freeList(Y); } Change calls to free into calls to poolfree  retain explicit deallocation

More Related