Memory Management for High-Performance Applications

Memory Management for High-Performance Applications Emery Berger University of Massachusetts, Amherst

High-Performance Applications • Web servers, search engines, scientific codes • C or C++ (still…) • Run on one or cluster of server boxes software compiler • Needs support at every level runtime system operating system hardware

New Applications,Old Memory Managers • Applications and hardware have changed • Multiprocessors now commonplace • Object-oriented, multithreaded • Increased pressure on memory manager(malloc, free) • But memory managers have not kept up • Inadequate support for modern applications

Current Memory ManagersLimit Scalability • As we add processors, program slows down • Caused by heap contention Larson server benchmark on 14-processor Sun

The Problem • Current memory managersinadequate for high-performance applications on modern architectures • Limit scalability, application performance, and robustness

This Talk • Building memory managers • Heap Layers framework [PLDI 2001] • Problems with current memory managers • Contention, false sharing, space • Solution: provably scalable memory manager • Hoard [ASPLOS-IX] • Extended memory manager for servers • Reap [OOPSLA 2002]

Implementing Memory Managers • Memory managers must be • Space efficient • Very fast • Heavily-optimized code • Hand-unrolled loops • Macros • Monolithic functions • Hard to write, reuse, or extend

Building Modular Memory Managers • Classes • Overhead • Rigid hierarchy • Mixins • No overhead • Flexible hierarchy

A Heap Layer • Mixin with malloc & free methods • template <class SuperHeap>class GreenHeapLayer : • public SuperHeap {…};

Example:Thread-Safe Heap Layer LockedHeap protect the superheap with a lock LockedMallocHeap

Empirical Results • Heap Layers vs. originals: • KingsleyHeapvs. BSD allocator • LeaHeapvs. DLmalloc 2.7 • Competitive runtime and memory efficiency

Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap

Problems with General-Purpose Memory Managers • Previous work for multiprocessors • Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92] • Impractical • Multiple heaps [Larson 98, Gloger 99] • Reduce contention but cause other problems: • P-fold or even unbounded increase in space • Allocator-induced false sharing we show

Multiple Heap Allocator:Pure Private Heaps • One heap per processor: • malloc gets memoryfrom its local heap • free puts memoryon its local heap • STL, Cilk, ad hoc Key: = in use, processor 0 = free, on heap 1 processor 0 processor 1 x1= malloc(1) x2= malloc(1) free(x1) free(x2) x4= malloc(1) x3= malloc(1) free(x3) free(x4)

Problem:Unbounded Memory Consumption • Producer-consumer: • Processor 0 allocates • Processor 1 frees • Unbounded memory consumption • Crash! processor 0 processor 1 x1= malloc(1) free(x1) x2= malloc(1) free(x2) x3= malloc(1) free(x3)

Multiple Heap Allocator:Private Heaps with Ownership • free returns memory to original heap • Bounded memory consumption • No crash! • “Ptmalloc” (Linux),LKmalloc processor 0 processor 1 x1= malloc(1) free(x1) x2= malloc(1) free(x2)

Problem:P-fold Memory Blowup • Occurs in practice • Round-robin producer-consumer • processor i mod P allocates • processor (i+1) mod P frees • Footprint = 1 (2GB),but space = 3 (6GB) • Exceeds 32-bit address space: Crash! processor 0 processor 1 processor 2 x1= malloc(1) free(x1) x2= malloc(1) free(x2) x3=malloc(1) free(x3)

Problem:Allocator-Induced False Sharing • False sharing • Non-shared objectson same cache line • Bane of parallel applications • Extensively studied • All these allocatorscause false sharing! cache line processor 0 processor 1 x1= malloc(1) x2= malloc(1) thrash… thrash…

So What Do We Do Now? • Where do we put free memory? • on central heap: • on our own heap:(pure private heaps) • on the original heap:(private heaps with ownership) • How do we avoid false sharing? • Heap contention • Unbounded memory consumption • P-fold blowup

Hoard: Key Insights • Bound local memory consumption • Explicitly track utilization • Move free memory to a global heap • Provably bounds memory consumption • Manage memory in large chunks • Avoids false sharing • Reduces heap contention

Overview of Hoard global heap • Manage memory in heap blocks • Page-sized • Avoids false sharing • Allocate from local heap block • Avoids heap contention • Low utilization • Move heap block to global heap • Avoids space blowup processor 0 processor P-1 …

Summary of Analytical Results • Space consumption: near optimal worst-case • Hoard: O(n log M/m + P) {P « n} • Optimal: O(n log M/m)[Robson 70]: ≈ bin-packing • Private heaps with ownership: O(P n log M/m) • Provably low synchronization • n = memory required • M = biggest object size • m = smallest object size • P = processors

Empirical Results • Measure runtime on 14-processor Sun • Allocators • Solaris (system allocator) • Ptmalloc (GNU libc) • mtmalloc (Sun’s “MT-hot” allocator) • Micro-benchmarks • Threadtest: no sharing • Larson: sharing (server-style) • Cache-scratch: mostly reads & writes (tests for false sharing) • Real application experience similar

Runtime Performance: threadtest • Many threads,no sharing • Hoard achieves linear speedup speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

Runtime Performance: Larson • Many threads,sharing(server-style) • Hoard achieves linear speedup

Runtime Performance:false sharing • Many threads,mostly reads & writes of heap data • Hoard achieves linear speedup

Hoard in the “Real World” • Open source code • www.hoard.org • 13,000 downloads • Solaris, Linux, Windows, IRIX, … • Widely used in industry • AOL, British Telecom, Novell, Philips • Reports: 2x-10x, “impressive” improvement in performance • Search server, telecom billing systems, scene rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM • Scalable general-purpose memory manager

Custom Memory Allocation • Programmers often replace malloc/free • Attempt to increase performance • Provide extra functionality (e.g., for servers) • Reduce space (rarely) • Empirical study of custom allocators • Lea allocator often as fast or faster • Custom allocation ineffective, except for regions. [OOPSLA 2002]

Fast Pointer-bumping allocation Deletion of chunks Convenient One call frees all memory Overview of Regions • Separate areas, deletion only en masse regioncreate(r) r regionmalloc(r, sz) regiondelete(r) • Risky • Accidental deletion • Too much space

Why Regions? • Apparently faster, more space-efficient • Servers need memory management support: • Avoid resource leaks • Tear down memory associated with terminated connections or transactions • Current approach (e.g., Apache): regions

Drawbacks of Regions • Can’t reclaim memory within regions • Problem for long-running computations,producer-consumer patterns,off-the-shelf “malloc/free” programs • unbounded memory consumption • Current situation for Apache: • vulnerable to denial-of-service • limits runtime of connections • limits module programming

Reap Hybrid Allocator • Reap = region + heap • Adds individual object deletion & heap reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) • Can reduce memory consumption • Fast • Adapts to use (region or heap style) • Cheap deletion

Using Reap as Regions Reap performance nearly matches regions

Reap: Best of Both Worlds • Combining new/delete with regionsusually impossible: • Incompatible API’s • Hard to rewrite code • Use Reap: Incorporate new/delete code into Apache • “mod_bc” (arbitrary-precision calculator) • Changed 20 lines (out of 8000) • Benchmark: compute 1000th prime • With Reap: 240K • Without Reap: 7.4MB

Open Questions • Grand Unified Memory Manager? • Hoard + Reap • Integration with garbage collection • Effective Custom Allocators? • Exploit sizes, lifetimes, locality and sharing • Challenges of newer architectures • NUMA, SMT/CMP, 64-bit, predication

Current Work: Robust Performance • Currently: no VM-GC communicaton • BAD interactions under memory pressure • Our approach (with Eliot Moss, Scott Kaplan):Cooperative Robust Automatic Memory Management LRU queue Virtual memory manager Garbage collector/ allocator memory pressure empty pages reduced impact

Current Work: Predictable VMM • Recent work on scheduling for QoS • E.g., proportional-share • Under memory pressure, VMM is scheduler • Paged-out processes may never recover • Intermittent processes may wait long time • Scheduler-faithful virtual memory(with Scott Kaplan, Prashant Shenoy) • Based on page value rather than order

Conclusion Memory management for high-performance applications • Heap Layersframework [PLDI 2001] • Reusable components, no runtime cost • Hoard scalable memory manager [ASPLOS-IX] • High-performance, provably scalable & space-efficient • Reap hybrid memory manager [OOPSLA 2002] • Provides speed & robustness for server applications • Current work: robust memory management for multiprogramming

The Obligatory URL Slide http://www.cs.umass.edu/~emery

If You Can Read This,I Went Too Far

Hoard: Under the Hood get or return memory to global heap malloc from local heap, free to heap block select heap based on size

Custom Memory Allocation • Replace new/delete,bypassing general-purpose allocator • Reduce runtime – often • Expand functionality – sometimes • Reduce space – rarely • Very common practice • Apache, gcc, lcc, STL, database servers… • Language-level support in C++ “Use custom allocators”

Drawbacks of Custom Allocators • Avoiding memory manager means: • More code to maintain & debug • Can’t use memory debuggers • Not modular or robust: • Mix memory from customand general-purpose allocators → crash! • Increased burden on programmers

Overview • Introduction • Perceived benefits and drawbacks • Three main kinds of custom allocators • Comparison with general-purpose allocators • Advantages and drawbacks of regions • Reaps – generalization of regions & heaps

(1) Per-Class Allocators • Recycle freed objects from a free list a = new Class1; b = new Class1; c = new Class1; delete a; delete b; delete c; a = new Class1; b = new Class1; c = new Class1; • Fast • Linked list operations • Simple • Identical semantics • C++ language support • Possibly space-inefficient a b c

end_of_array end_of_array end_of_array end_of_array end_of_array end_of_array (II) Custom Patterns • Tailor-made to fit allocation patterns • Example: 197.parser (natural language parser) d a b c char[MEMORY_LIMIT] a = xalloc(8); b = xalloc(16); c = xalloc(8); xfree(b); xfree(c); d = xalloc(8); • Fast • Pointer-bumping allocation • Brittle • Fixed memory size • Requires stack-like lifetimes

Fast Pointer-bumping allocation Deletion of chunks Convenient One call frees all memory (III) Regions • Separate areas, deletion only en masse regioncreate(r) r regionmalloc(r, sz) regiondelete(r) • Risky • Accidental deletion • Too much space

Overview • Introduction • Perceived benefits and drawbacks • Three main kinds of custom allocators • Comparison with general-purpose allocators • Advantages and drawbacks of regions • Reaps – generalization of regions & heaps

Memory Management for High-Performance Applications