Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications Berger*, McKinley+, Blumofe*, Wilson* *UT Austin, +Massachusetts ASPLOS 2000 Presented by BogdanSimion

Dynamic Memory Allocation • Highly parallel applications common • e.g., databases, web servers, assignment # 2 • Dynamic memory allocation ubiquitous • malloc, free, new, delete, etc. • Serial memory allocators are inadequate • Sufficient for correctness • Do not scale for multithreaded applications • Existing concurrent allocators do not meet requirements

Parallel Allocator Requirements • Speed • As fast as a serial allocator on a single-processor system • Scalability • Performance scales linearly with number of processors • False sharing avoidance • Does not introduce false sharing of cache lines • Low fragmentation • Keep (OS Allocated Mem / App. Allocated Mem) low • How does fragmentation affect performance?

Parallel Allocator Requirements • Speed • As fast as a serial allocator on a single-processor system • Scalability • Performance scales linearly with number of processors • False sharing avoidance • Does not introduce false sharing of cache lines • Low fragmentation • Keep (OS Allocated Mem / App. Allocated Mem) low • Fragmentation affects performance: Locality & Swapping

False Sharing • Multiple processors share bytes on the same cache line without sharing bytes • CPU 1 uses int @ 0x1000 and CPU 2 uses 0x1004 • Program induced: data passed between threads • How can an allocator avoid this?

False Sharing • Multiple processors share bytes on the same cache line without sharing bytes • CPU 1 uses int @ 0x1000 and CPU 2 uses 0x1004 • Program induced: data passed between threads • How can an allocator avoid this? • Allocator induced: • Active – malloc returns heap objects on same cache line to different threads • Passive – free allows a future malloc to produce false sharing

Blowup • A special case of fragmentation: max allocated max allocated by ideal uniprocessor allocator • Unbounded, or grows linearly with # of CPUs • Caused by parallel allocator not using freed memory to satisfy future allocation requests • E.g., thread-private heaps with no ownership & producer-consumer • All data allocated on the producer’s heap and released on the consumer’s heap

Related Work(i.e., What to Avoid in Assignment 2) • Serial single heap • 1 locked free list • Low fragmentation, quite fast • Lock contention => poor scaling • Active false sharing • Concurrent single heap • 1 locked free list per block size • Reduces to serial single heap in the common case (most allocated objects are of only a few sizes) • Active false sharing • Too many expensive locks or atomic operations

Related Work(i.e., What to Avoid in Assignment 2) • Multiple-heaps: • Heap assignment: 1-to-1, round-robin, map function • 1. Pure private heaps • Blocks freed to calling thread's heap • Unbounded memory use for producer-consumer • Passive false sharing

Related Work(i.e., What to Avoid in Assignment 2) • 2. Private heaps with ownership • Blocks freed to allocating thread's heap • Has O(P) blowup unless there is some affordance for redistributing freed memory • Some actively induce false sharing • 3. Private heaps with thresholds • Vee and Hsu, DYNIX • efficient and scalable • a hierarchy of per-processor heaps and shared heaps • O(1) blowup • Passively induce false sharing

Related Work(i.e., What to Avoid in Assignment 2)

Hoard’s scalable memory allocator • Fast (performance) • Highly scalable (with P) • Avoid false sharing • Memory efficient (low fragmentation, avoid blowup)

Hoard's Design • Per-processor heaps and a single global heap • Threads are mapped to a processor's heap • N.B., a thread that is scheduled on another processor still uses its original processor's heap • Heaps divided into page-aligned superblocks • Superblock divided into blocks • All of a superblock's blocks have the same size • Blocks have size b, b2, b3, b4, ... • Bounds internal fragmentation

Bounding Blowup • A heap owns some superblocks • Assigned from global heap on allocation • A heap only allocates from superblocks it owns • When no mem available in any superblock on a thread’s heap • Obtain a superblock from the global heap, if available • If not (global heap empty too), create new superblock request from OS, and add to thread’s heap • Does not return empty superblocks to OS • Reuse them instead

Bounding Blowup • Superblocks returned to global heap when f, the empty fraction,of blocks are not in use • If heap not more than f empty, and has K (ct, fixed) or fewer superblocks, no SBs get moved to global heap • Intuitively, these conditions maintain invariants about the proportion of wasted space on each heap

Bounding Blowup • Superblocks returned to global heap when f, the empty fraction,of blocks are not in use • Gives O(1) blowup • Also limits the amount of false sharing since released SBs are guaranteed to be at least f-empty • “Fullness” groups: • bins of superblocks with the same “fullness“ range • LIFO order => reuse superblock that is already in memory; also likely to reuse a block already in cache => maintains good locality

Example • f = 0.25 • K = 0

Avoiding False Sharing • Heap allocations are made from superblocks • Different superblocks lie on different cache lines • Each superblock owned by one heap • Avoids false sharing • Freed memory returns to allocating superblock • Avoids passively-induced false sharing • How can the allocator still induce false sharing?

Avoiding False Sharing • Heap allocations are made from superblocks • Different superblocks lie on different cache lines • Each superblock owned by one heap • Avoids false sharing • Freed memory returns to allocating superblock • Avoids passively-induced false sharing • How can the allocator still induce false sharing? • Multiple running threads using the same heap • Superblocks returned to the global heap aren't empty

Malloc Algorithm

Free Algorithm

Avoiding Contention • Lock contention low for scalable applications • Allocation by one thread and freeing by another is uncommon • Producer-consumer is realistic worst case • Memory operations serialized for two threads • Global heap rarely accessed • Steady-state memory use is within a constant factor of maximum memory use

Results • In general, Hoard performs & scales very well • Performance & scalability poor when few objects relative to distinct block sizes • Most requests result in superblock creation • This is an uncommon memory allocation pattern

Results - performance

Results – avoid FS

Read the paper for full details! • Assignment 2… coming soon!

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Presentation Transcript

Memory Techniques for Interpreters

Chapter 14 Memory System

Scalable Video Coding

Grace : Safe Multithreaded Programming for C/C++

Scalable Many-Core Memory Systems Topic 2 : Emerging Technologies and Hybrid Memories

MEMORY

Scalable Many-Core Memory Systems Topic 3 : Memory Interference and QoS -Aware Memory Systems

Scalable Many-Core Memory Systems Optional Topic 4: Cache Management

Scalable Performance Optimizations for Dynamic Applications

Dynamic Memory Management

CS4100: 計算機結構 Memory Hierarchy

Alcatel OmniVista 4760 R4.1 Product presentation

Writing and tuning OpenMP programs on distributed shared memory platforms

Parallel Computing with OpenMP on distributed shared memory platforms

Biopsychology of Memory

Scalable Web Architectures

Multithreaded Processors

AP PSYCHOLOGY Review for the AP Exam

MEMORY

Virtual Memory

Memory Interface