1 / 29

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications. Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Ivan Jibaja (Some slides adapted from Emery Berger’s presentation). Outline. Motivation Problems in allocator design False sharing Fragmentation

tress
Download Presentation

Hoard: A Scalable Memory Allocator for Multithreaded Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Ivan Jibaja (Some slides adapted from Emery Berger’s presentation)

  2. Outline • Motivation • Problems in allocator design • False sharing • Fragmentation • Existing approaches • Hoard design • Experimental evaluation

  3. Motivation • Parallel multithreaded programs prevalent • Web servers, search engines, DB managers etc. • Run on CMP/SMP for high performance • Memory allocation is a bottleneck • Prevents scaling with number of processors

  4. Desired allocator attributes on a multiprocessor system • Speed • Competitive with uniprocessor allocators on 1 cpu • Scalability • Performance linear with the number of processors • Fragmentation (=max allocated / max in use) • High fragmentation  poor data locality  paging • False sharing avoidance

  5. Program causes false sharing Allocate number of objects in a cache line, pass objects to different threads Allocators cause false sharing! Actively: malloc satisfies different thread requests from same cache line Passively: free allows future malloc to produce false sharing The problem of false sharing A cache line processor 1 processor 2 x1 = malloc(s); x2 = malloc(s); thrash… thrash…

  6. The problem of fragmentation • Blowup: • Increase in memory consumption when allocator reclaims memory freed by program, but fails to use it for future requests • Mainly a problem of concurrent allocators • Unbounded (worst case) or bounded (O(P))

  7. Example: Pure Private Heaps Allocator processor 1 processor 2 • Pure private heaps: • one heap per processor. • malloc gets memoryfrom the processor's heap or the system • free puts memory on the processor's heap • Avoids heap contention • Examples: STL, Cilk x1= malloc(s) x2= malloc(s) free(x1) free(x2) x4= malloc(s) x3= malloc(s) = allocated by heap 1 = free, on heap 2

  8. How to Break Pure Private Heaps: Fragmentation • Pure private heaps: • memory consumption can grow without bound! • Producer-consumer: • processor 1 allocates • processor 2 frees • Memory always unavailable to producer processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3= malloc(s) free(x3)

  9. Example II: Private Heaps with Ownership • free puts memory back on the originating processor's heap. • Avoids unbounded memory consumption • Examples: ptmalloc,LKmalloc processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2)

  10. How to Break Private Heaps with Ownership:Fragmentation • memory consumption can blowup by a factor of P. • Round-robin producer-consumer: processor i allocates processor i+1 frees • Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks processor 1 processor 2 processor 3 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3=malloc(s) free(x3)

  11. Existing approaches

  12. Uniprocessor Allocators on Multiprocessors • Fragmentation: Excellent • Very low for most programs [Wilson & Johnstone] • Speed & Scalability: Poor • Heap contention • A single lock protects the heap • Can exacerbate false sharing • Different processors can share cache lines

  13. Existing Multiprocessor Allocators • Speed: • One concurrent heap (e.g., concurrent B-tree): • O(log (#size-classes)) cost per memory operation • too many locks/atomic updates  Fast allocators use multiple heaps • Scalability: • Allocator-induced false sharing • Other bottlenecks (e.g. nextHeap global in Ptmalloc) • Fragmentation: • P-fold increase or even unbounded

  14. Hoard as the solution

  15. Hoard Overview • P per-processor heaps & 1 global heap • Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of same-sized objects (LIFO free-list) • Avoids false sharing by not carving up cache lines • Avoids heap contention – local heaps allocate & free small blocks from their superblocks • Avoids blowup by • Moving superblocks to global heap when fraction of free memory exceeds some threshold

  16. Superblock management Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S) f = ¼ K = 0 • Multiple heaps  Avoid actively induced false sharing • Block coalescing  Avoid passively induced false sharing • Superblocks transferred are usually empty and transfer is infrequent

  17. Hoard pseudo-code malloc(sz) • If sz > S/2, allocate the superblock from the OS and return it. • i hash(current thread) • Lock heap i • Scan heap i’s list of superblocks from full to least (for the size class of sz) • If there is no superblock with free space { • Check heap 0 (global) for a superblock • If there is none { • Allocate S bytes as superblock s & set owner to heap i • } Else { • Transfer the superblock s to heap i • u0  u0 – s.u; ui ui + s.u • a0  a0 - S; ai  ai + S • } • } • ui ui + sz; s.u  s.u + sz • Unlock heap i • Return a block from the superblock free(ptr) • If the block is “large” • Free superblock to OS and return • Find the superblock s this blocks comes from • Lock s • Lock heap i, the superblock’s owner • Deallocate the block from the superblock • uiui – block size • s.u  s.u – block size • If (i = 0) unlock heap i, superblock s and return • If (ui < ai – K*S) and (ui<(1-f)*ai) { • Transfer a mostly-empty superblock s1 to heap 0 (global) • u0 u0 + s1.u; ui  ui – s1.u • a0  a0 + S; ai  ai – S • } • Unlock heap i and superblock s

  18. Heap contention • Per-processor Heap contention • 1 allocator thread / multiple threads free • Inherently unscalable • Pairs of producer/consumer threads • malloc/free calls serialized • At most 2X slowdown (undesirable but scalable) • Empirically only a small fraction of memory is freed by another thread  Contention expected to be low

  19. Heap contention (2) • Global Heap contention • Measure # GH lock acquisitionsas upper bound • Growing phase: • Each thread at most k/(f*S/s) acquisitions for kmalloc’s • Shrinking phase: • Pathological case where program frees (1-f) of each superblock and then frees every block in superblock one at a time • Empirically: No excessive shrinking and gradual growth of memory usage  low overall contention

  20. Experimental Evaluation • Dedicated 14-processor Sun Enterprise • 400 MHz Ultrasparc • 2 GB RAM, 4MB L2 cache • Solaris 7 • Superblock size=8K, f = ¼ • Comparison between • Hoard • Ptmalloc (GNU libC, multiple heaps & ownership) • Mtmalloc (Solaris multithreaded allocator) • Solaris (default system allocator)

  21. Benchmarks

  22. Speed Size classes need to be handled more cleverly

  23. Scalability - threadtest 278% faster than Ptmalloc on 14 cpus t threads allocate/deallocate 100,000/t 8-byte objects

  24. Scalability – Larson • “Bleeding” typical in server applications • Mainly stays within empty fraction during execution • 18X faster than next best allocator on 14 cpus

  25. Scalability - BEMengine • Few times below empty fraction  low synchronization

  26. False sharing behavior • Active-false: Each thread allocates small object, writes it few times, frees it • Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false • Illustrate effects of contention of the coherence mechanism

  27. Fragmentation results Large number of size classes remain live for duration of program and scattered across blocks Within 20% of Lea’s allocator

  28. Hoard Conclusions • Speed: Excellent • As fast as a uniprocessor allocator on one processor • amortized O(1) cost • 1 lock for malloc, 2 for free • Scalability: Excellent • Scales linearly with the number of processors • Avoids false sharing • Fragmentation: Very good • Worst-case is provably close to ideal • Actual observed fragmentation is low

  29. Discussion Points • If we had to re-evaluate Hoard today which benchmarks would we use? • Are there any changes needed to make it work with languages like Java?

More Related