270 likes | 489 Views
Hoard: A Scalable Memory Allocator for Multithreaded Applications Berger * , McKinley + , Blumofe * , Wilson * * UT Austin, + Massachusetts ASPLOS 2000 Presented by Bogdan Simion. Dynamic Memory Allocation. Highly parallel applications common e.g., databases, web servers, assignment # 2
E N D
Hoard: A Scalable Memory Allocator for Multithreaded Applications Berger*, McKinley+, Blumofe*, Wilson* *UT Austin, +Massachusetts ASPLOS 2000 Presented by BogdanSimion
Dynamic Memory Allocation • Highly parallel applications common • e.g., databases, web servers, assignment # 2 • Dynamic memory allocation ubiquitous • malloc, free, new, delete, etc. • Serial memory allocators are inadequate • Sufficient for correctness • Do not scale for multithreaded applications • Existing concurrent allocators do not meet requirements
Parallel Allocator Requirements • Speed • As fast as a serial allocator on a single-processor system • Scalability • Performance scales linearly with number of processors • False sharing avoidance • Does not introduce false sharing of cache lines • Low fragmentation • Keep (OS Allocated Mem / App. Allocated Mem) low • How does fragmentation affect performance?
Parallel Allocator Requirements • Speed • As fast as a serial allocator on a single-processor system • Scalability • Performance scales linearly with number of processors • False sharing avoidance • Does not introduce false sharing of cache lines • Low fragmentation • Keep (OS Allocated Mem / App. Allocated Mem) low • Fragmentation affects performance: Locality & Swapping
False Sharing • Multiple processors share bytes on the same cache line without sharing bytes • CPU 1 uses int @ 0x1000 and CPU 2 uses 0x1004 • Program induced: data passed between threads • How can an allocator avoid this?
False Sharing • Multiple processors share bytes on the same cache line without sharing bytes • CPU 1 uses int @ 0x1000 and CPU 2 uses 0x1004 • Program induced: data passed between threads • How can an allocator avoid this? • Allocator induced: • Active – malloc returns heap objects on same cache line to different threads • Passive – free allows a future malloc to produce false sharing
Blowup • A special case of fragmentation: max allocated max allocated by ideal uniprocessor allocator • Unbounded, or grows linearly with # of CPUs • Caused by parallel allocator not using freed memory to satisfy future allocation requests • E.g., thread-private heaps with no ownership & producer-consumer • All data allocated on the producer’s heap and released on the consumer’s heap
Related Work(i.e., What to Avoid in Assignment 2) • Serial single heap • 1 locked free list • Low fragmentation, quite fast • Lock contention => poor scaling • Active false sharing • Concurrent single heap • 1 locked free list per block size • Reduces to serial single heap in the common case (most allocated objects are of only a few sizes) • Active false sharing • Too many expensive locks or atomic operations
Related Work(i.e., What to Avoid in Assignment 2) • Multiple-heaps: • Heap assignment: 1-to-1, round-robin, map function • 1. Pure private heaps • Blocks freed to calling thread's heap • Unbounded memory use for producer-consumer • Passive false sharing
Related Work(i.e., What to Avoid in Assignment 2) • 2. Private heaps with ownership • Blocks freed to allocating thread's heap • Has O(P) blowup unless there is some affordance for redistributing freed memory • Some actively induce false sharing • 3. Private heaps with thresholds • Vee and Hsu, DYNIX • efficient and scalable • a hierarchy of per-processor heaps and shared heaps • O(1) blowup • Passively induce false sharing
Hoard’s scalable memory allocator • Fast (performance) • Highly scalable (with P) • Avoid false sharing • Memory efficient (low fragmentation, avoid blowup)
Hoard's Design • Per-processor heaps and a single global heap • Threads are mapped to a processor's heap • N.B., a thread that is scheduled on another processor still uses its original processor's heap • Heaps divided into page-aligned superblocks • Superblock divided into blocks • All of a superblock's blocks have the same size • Blocks have size b, b2, b3, b4, ... • Bounds internal fragmentation
Bounding Blowup • A heap owns some superblocks • Assigned from global heap on allocation • A heap only allocates from superblocks it owns • When no mem available in any superblock on a thread’s heap • Obtain a superblock from the global heap, if available • If not (global heap empty too), create new superblock request from OS, and add to thread’s heap • Does not return empty superblocks to OS • Reuse them instead
Bounding Blowup • Superblocks returned to global heap when f, the empty fraction,of blocks are not in use • If heap not more than f empty, and has K (ct, fixed) or fewer superblocks, no SBs get moved to global heap • Intuitively, these conditions maintain invariants about the proportion of wasted space on each heap
Bounding Blowup • Superblocks returned to global heap when f, the empty fraction,of blocks are not in use • Gives O(1) blowup • Also limits the amount of false sharing since released SBs are guaranteed to be at least f-empty • “Fullness” groups: • bins of superblocks with the same “fullness“ range • LIFO order => reuse superblock that is already in memory; also likely to reuse a block already in cache => maintains good locality
Example • f = 0.25 • K = 0
Avoiding False Sharing • Heap allocations are made from superblocks • Different superblocks lie on different cache lines • Each superblock owned by one heap • Avoids false sharing • Freed memory returns to allocating superblock • Avoids passively-induced false sharing • How can the allocator still induce false sharing?
Avoiding False Sharing • Heap allocations are made from superblocks • Different superblocks lie on different cache lines • Each superblock owned by one heap • Avoids false sharing • Freed memory returns to allocating superblock • Avoids passively-induced false sharing • How can the allocator still induce false sharing? • Multiple running threads using the same heap • Superblocks returned to the global heap aren't empty
Avoiding Contention • Lock contention low for scalable applications • Allocation by one thread and freeing by another is uncommon • Producer-consumer is realistic worst case • Memory operations serialized for two threads • Global heap rarely accessed • Steady-state memory use is within a constant factor of maximum memory use
Results • In general, Hoard performs & scales very well • Performance & scalability poor when few objects relative to distinct block sizes • Most requests result in superblock creation • This is an uncommon memory allocation pattern
Read the paper for full details! • Assignment 2… coming soon!