780 likes | 935 Views
Memory Management for High-Performance Applications. Emery Berger. Advisor: Kathryn S. McKinley. Department of Computer Sciences. High-Performance Applications. Web servers, search engines, scientific codes C or C++ Run on one or cluster of server boxes. software. compiler.
E N D
Memory Management forHigh-Performance Applications Emery Berger Advisor: Kathryn S. McKinley Department of Computer Sciences Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
High-Performance Applications • Web servers, search engines, scientific codes • C or C++ • Run on one or cluster of server boxes software compiler • Needs support at every level runtime system operating system hardware Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
New Applications,Old Memory Managers • Applications and hardware have changed • Multiprocessors now commonplace • Object-oriented, multithreaded • Increased pressure on memory manager(malloc, free) • But memory managers have not changed • Inadequate support for modern applications Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Current Memory ManagersLimit Scalability • As we add processors, program slows down • Caused by heap contention Larson server benchmark on 14-processor Sun Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
The Problem • Current memory managersinadequate for high-performance applications on modern architectures • Limit scalability, application performance, and robustness Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Contributions • Building memory managers • Heap Layers framework • Problems with current memory managers • Contention, false sharing, space • Solution: provably scalable memory manager • Hoard • Extended memory manager for servers • Reap Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Implementing Memory Managers • Memory managers must be • Space efficient • Very fast • Heavily-optimized code • Hand-unrolled loops • Macros • Monolithic functions • Hard to write, reuse, or extend Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Building Modular Memory Managers • Classes • Rigid hierarchy • Overhead • Mixins • Flexible hierarchy • No overhead Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
A Heap Layer • Mixin with malloc & free methods • template <class SuperHeap>class GreenHeapLayer : • public SuperHeap {…}; Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Example:Thread-Safe Heap Layer LockedHeap protect the superheap with a lock LockedMallocHeap Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Empirical Results • Heap Layers vs. originals: • KingsleyHeapvs. BSD allocator • LeaHeapvs. DLmalloc 2.7 • Competitive runtime and memory efficiency Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Problems with General-Purpose Memory Managers • Previous work for multiprocessors • Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92] • Impractical • Multiple heaps [Larson 98, Gloger 99] • Reduce contention but cause other problems: • P-fold or even unbounded increase in space • Allocator-induced false sharing we show Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Multiple Heap Allocator:Pure Private Heaps • One heap per processor: • malloc gets memoryfrom its local heap • free puts memoryon its local heap • STL, Cilk, ad hoc Key: = in use, processor 0 = free, on heap 1 processor 0 processor 1 x1= malloc(1) x2= malloc(1) free(x1) free(x2) x4= malloc(1) x3= malloc(1) free(x3) free(x4) Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Problem:Unbounded Memory Consumption • Producer-consumer: • Processor 0 allocates • Processor 1 frees • Unbounded memory consumption • Crash! processor 0 processor 1 x1= malloc(1) free(x1) x2= malloc(1) free(x2) x3= malloc(1) free(x3) Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Multiple Heap Allocator:Private Heaps with Ownership • free returns memory to original heap • Bounded memory consumption • No crash! • “Ptmalloc” (Linux),LKmalloc processor 0 processor 1 x1= malloc(1) free(x1) x2= malloc(1) free(x2) Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Problem:P-fold Memory Blowup • Occurs in practice • Round-robin producer-consumer • processor i mod P allocates • processor (i+1) mod P frees • Footprint = 1 (2GB),but space = 3 (6GB) • Exceeds 32-bit address space: Crash! processor 0 processor 1 processor 2 x1= malloc(1) free(x1) x2= malloc(1) free(x2) x3=malloc(1) free(x3) Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Problem:Allocator-Induced False Sharing • False sharing • Non-shared objectson same cache line • Bane of parallel applications • Extensively studied • All these allocatorscause false sharing! cache line processor 0 processor 1 x1= malloc(1) x2= malloc(1) thrash… thrash… Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
So What Do We Do Now? • Where do we put free memory? • on central heap: • on our own heap:(pure private heaps) • on the original heap:(private heaps with ownership) • How do we avoid false sharing? • Heap contention • Unbounded memory consumption • P-fold blowup Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Hoard: Key Insights • Bound local memory consumption • Explicitly track utilization • Move free memory to a global heap • Provably bounds memory consumption • Manage memory in large chunks • Avoids false sharing • Reduces heap contention Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Overview of Hoard global heap • Manage memory in heap blocks • Page-sized • Avoids false sharing • Allocate from local heap block • Avoids heap contention • Low utilization • Move heap block to global heap • Avoids space blowup processor 0 processor P-1 … Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Hoard: Under the Hood get or return memory to global heap malloc from local heap, free to heap block select heap based on size Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Summary of Analytical Results • Space consumption: near optimal worst-case • Hoard: O(n log M/m + P) {P ¿ n} • Optimal: O(n log M/m)[Robson 70]: ≈ bin-packing • Private heaps with ownership: O(P n log M/m) • Provably low synchronization • n = memory required • M = biggest object size • m = smallest object size • P = processors Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Empirical Results • Measure runtime on 14-processor Sun • Allocators • Solaris (system allocator) • Ptmalloc (GNU libc) • mtmalloc (Sun’s “MT-hot” allocator) • Micro-benchmarks • Threadtest: no sharing • Larson: sharing (server-style) • Cache-scratch: mostly reads & writes (tests for false sharing) • Real application experience similar Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Runtime Performance: threadtest • Many threads,no sharing • Hoard achieves linear speedup speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors) Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Runtime Performance: Larson • Many threads,sharing(server-style) • Hoard achieves linear speedup Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Runtime Performance:false sharing • Many threads,mostly reads & writes of heap data • Hoard achieves linear speedup Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Hoard in the “Real World” • Open source code • www.hoard.org • 13,000 downloads • Solaris, Linux, Windows, IRIX, … • Widely used in industry • AOL, British Telecom, Novell, Philips • Reports: 2x-10x, “impressive” improvement in performance • Search server, telecom billing systems, scene rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM • Scalable general-purpose memory manager Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Custom Memory Allocation • Programmers replace malloc/free • Attempt to increase performance • Provide extra functionality (e.g., for servers) • Reduce space (rarely) • Empirical study of custom allocators: • Lea allocator often as fast or faster • Custom allocation ineffective, except for regions. Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Overview of Regions • Regions: separate areas, deletion only en masse regioncreate(r) r regionmalloc(r, sz) regiondelete(r) • Used in parsers, server applications Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Server Support • Certain servers need additional support • “Process” isolation • Multiple threads, many transactions per thread • Minimize accidental overwrites of unrelated data • Avoid resource leaks • Tear down all memory associated with terminated connections or transactions • Current approach (e.g., Apache): regions Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Fast Pointer-bumping allocation Deletion of chunks Convenient One call frees all memory Space Can’t free objects Drag Can’t use for allallocation patterns Regions: Pros and Cons • Regions: separate areas, deletion only en masse regioncreate(r) r regionmalloc(r, sz) regiondelete(r) Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Regions Are Limited • Can’t reclaim memory in regions unbounded memory consumption • Long-running computations • Producer-consumer patterns • Current situation for Apache: • vulnerable to denial-of-service • limits runtime of connections • limits module programming • Regions: wrong abstraction Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Reap Hybrid Allocator • Reap = region + heap • Adds individual object deletion & heap reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) • Can reduce memory consumption • Fast • Adapts to use (region or heap style) • Cheap deletion Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Using Reap as Regions Reap performance nearly matches regions Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Reap In Progress • Incorporate Reap in Apache • Rewrite modules to use Reap • Measure space savings • Simplifies module programming& adds robustness against denial-of-service Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Open Questions • Grand Unified Memory Manager? • Hoard + Reap • Integration with garbage collection • Effective Custom Allocators? • Exploit sizes, lifetimes, locality and sharing • Challenges of newer architectures • SMT/CMP Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Contributions Memory management for high-performance applications • Framework for buildinghigh-quality memory managers (Heap Layers)[Berger, Zorn & McKinley, PLDI-01] • Provably scalable memory manager (Hoard)[Berger, McKinley, Blumofe & Wilson, ASPLOS-IX] • Study of custom memory allocation;Hybrid high-performance memory managerfor server applications (Reap)[Berger, Zorn & McKinley, OOPSLA-2002] Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Backup Slides Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Empirical Results, Runtime Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Empirical Results, Space Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Robust Resource Management • User processes can bring down systems (= DoS) • Current solutions • Kill processes (Linux) • Die (Linux, Solaris, Windows) • Proposed solutions limit utilization • Quotas, proportional shares • Insight: As resources becomes scarce, make them “cost” more (apply economic model) Fork bomb – use all process ids • Malloc all memory – exhausts swap Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Future Work “Performance, scalability, and robustness” • Short-term • Memory management • False sharing • Robust garbage collection for multiprogrammed systems(with McKinley, Blackburn & Stylos) • Locality – self-reorganizing data structures • Compiler-based static error detection[Guyer, Berger & Lin, in preparation] • Longer term • Safety & security as dataflow problems • Integration of OS/runtime/compiler Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Rockall: Larson Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Rockall: Threadtest Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Hoard Conclusions • As fast as uniprocessor allocators • Performance linear in number of processors • Avoids false sharing • Worst-case provably near optimal • Scalable general-purpose memory manager Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger
Conceptually Modular malloc free Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger