310 likes | 377 Views
Hoard: A Scalable Memory Allocator for Multithreaded Applications. Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos (Some slides adapted from Emery Berger’s presentation). Outline. Motivation Problems in allocator design False sharing
E N D
Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos (Some slides adapted from Emery Berger’s presentation)
Outline • Motivation • Problems in allocator design • False sharing • Fragmentation • Existing approaches • Hoard design • Experimental evaluation
Motivation • Parallel multithreaded programs prevalent • Web servers, search engines, DB managers etc. • Run on CMP/SMP for high performance • Some of them embarrassingly parallel • Memory allocation is a bottleneck • Prevents scaling with number of processors
Desired allocator attributes on a multiprocessor system • Speed • Competitive with uniprocessor allocators on 1 cpu • Scalability • Performance linear with the number of processors • Fragmentation (=max allocated / max in use) • High fragmentation poor data locality paging • False sharing avoidance
Program causes false sharing Allocate number of objects in a cache line, pass objects to different threads Allocators cause false sharing! Actively: malloc satisfies different thread requests from same cache line Passively: free allows future malloc to produce false sharing The problem of false sharing A cache line processor 1 processor 2 x1 = malloc(s); x2 = malloc(s); thrash… thrash…
The problem of fragmentation • Blowup: • Increase in memory consumption when allocator reclaims memory freed by program, but fails to use it for future requests • Mainly a problem of concurrent allocators • Unbounded (worst case) or bounded (O(P))
Example: Pure Private Heaps Allocator processor 1 processor 2 • Pure private heaps: • one heap per processor. • malloc gets memoryfrom the processor's heap or the system • free puts memory on the processor's heap • Avoids heap contention • Examples: STL, Cilk x1= malloc(s) x2= malloc(s) free(x1) free(x2) x4= malloc(s) x3= malloc(s) = allocated by heap 1 = free, on heap 2
How to Break Pure Private Heaps: Fragmentation • Pure private heaps: • memory consumption can grow without bound! • Producer-consumer: • processor 1 allocates • processor 2 frees • Memory always unavailable to producer processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3= malloc(s) free(x3)
Example II: Private Heaps with Ownership • free puts memory back on the originating processor's heap. • Avoids unbounded memory consumption • Examples: ptmalloc,LKmalloc processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2)
How to Break Private Heaps with Ownership:Fragmentation • memory consumption can blowup by a factor of P. • Round-robin producer-consumer: processor i allocates processor i+1 frees • Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks processor 1 processor 2 processor 3 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3=malloc(s) free(x3)
Uniprocessor Allocators on Multiprocessors • Fragmentation: Excellent • Very low for most programs [Wilson & Johnstone] • Speed & Scalability: Poor • Heap contention • A single lock protects the heap • Can exacerbate false sharing • Different processors can share cache lines
Existing Multiprocessor Allocators • Speed: • One concurrent heap (e.g., concurrent B-tree): • O(log (#size-classes)) cost per memory operation • too many locks/atomic updates Fast allocators use multiple heaps • Scalability: • Allocator-induced false sharing • Other bottlenecks (e.g. nextHeap global in Ptmalloc) • Fragmentation: • P-fold increase or even unbounded
Hoard Overview • P per-processor heaps & 1 global heap • Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of same-sized objects (LIFO free-list) • Avoids false sharing by not carving up cache lines • Avoids heap contention – local heaps allocate & free small blocks from their superblocks • Avoids blowup by • Moving superblocks to global heap when fraction of free memory exceeds some threshold
Superblock management Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S) f = ¼ K = 0 • Multiple heaps Avoid actively induced false sharing • Block coalescing Avoid passively induced false sharing • Superblocks transferred are usually empty and transfer is infrequent
Hoard pseudo-code malloc(sz) • If sz > S/2, allocate the superblock from the OS and return it. • i hash(current thread) • Lock heap i • Scan heap i’s list of superblocks from full to least (for the size class of sz) • If there is no superblock with free space { • Check heap 0 (global) for a superblock • If there is none { • Allocate S bytes as superblock s & set owner to heap i • } Else { • Transfer the superblock s to heap i • u0 u0 – s.u; ui ui + s.u • a0 a0 - S; ai ai + S • } • } • ui ui + sz; s.u s.u + sz • Unlock heap i • Return a block from the superblock free(ptr) • If the block is “large” • Free superblock to OS and return • Find the superblock s this blocks comes from • Lock s • Lock heap i, the superblock’s owner • Deallocate the block from the superblock • uiui – block size • s.u s.u – block size • If (i = 0) unlock heap i, superblock s and return • If (ui < ai – K*S) and (ui<(1-f)*ai) { • Transfer a mostly-empty superblock s1 to heap 0 (global) • u0 u0 + s1.u; ui ui – s1.u • a0 a0 + S; ai ai – S • } • Unlock heap i and superblock s
Deriving bounds on blowup • blowup:= O(A(t) / U(t)) • A(t) = A’(t) • ai(t) – K*S ≤ ui(t)) ∨ (1-f)ai(t) ≤ ui(t) • P << U(t) blowup := O(1) • Worst case consumption is a constant factor overhead that does not grow with the amount of memory required by the program A(t) = O(U(t) + P)
Deriving bounds on contention (1) • Per-processor Heap contention • 1 allocator thread / multiple threads free • Inherently unscalable • Pairs of producer/consumer threads • malloc/free calls serialized • At most 2X slowdown (undesirable but scalable) • Empirically only a small fraction of memory is freed by another thread Contention expected to be low
Deriving bounds on contention (2) • Global Heap contention • Measure # GH lock acquisitionsas upper bound • Growing phase: • Each thread at most k/(f*S/s) acquisitions for kmalloc’s • Shrinking phase: • Pathological case where program frees (1-f) of each superblock and then frees every block in superblock one at a time • Empirically: No excessive shrinking and gradual growth of memory usage low overall contention
Experimental Evaluation • Dedicated 14-processor Sun Enterprise • 400 MHz Ultrasparc • 2 GB RAM, 4MB L2 cache • Solaris 7 • Superblock size=8K, f = ¼ • Comparison between • Hoard • Ptmalloc (GNU libC, multiple heaps & ownership) • Mtmalloc (Solaris multithreaded allocator) • Solaris (default system allocator)
Speed Size classes need to be handled more cleverly
Scalability - threadtest 278% faster than Ptmalloc on 14 cpus t threads allocate/deallocate 100,000/t 8-byte objects
Scalability – Larson • “Bleeding” typical in server applications • Mainly stays within empty fraction during execution • 18X faster than next best allocator on 14 cpus
Scalability - BEMengine • Few times below empty fraction low synchronization
False sharing behavior • Active-false: Each thread allocates small object, writes it few times, frees it • Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false • Illustrate effects of contention of the coherence mechanism
Fragmentation results Large number of size classes remain live for duration of program and scattered across blocks Within 20% of Lea’s allocator
Hoard Conclusions • Speed: Excellent • As fast as a uniprocessor allocator on one processor • amortized O(1) cost • 1 lock for malloc, 2 for free • Scalability: Excellent • Scales linearly with the number of processors • Avoids false sharing • Fragmentation: Very good • Worst-case is provably close to ideal • Actual observed fragmentation is low
Discussion Points • If we had to re-evaluate Hoard today which benchmarks would we use? • Are there any changes needed to make it work with languages like Java?