Fast Multiprocessor Memory Allocation and Garbage Collection

Fast Multiprocessor Memory Allocation and Garbage Collection Based on Hans-J. Boehm, HP Laboratories, December 2000 Presented by Yair Sade

About the Author • Works in HP labs. • Research on high-performance GCs. • gcj runtime is based his his ideas. • There are also commercial GCs based on his ideas (Geodesic GC). • Etc...

Introduction • Large scale commercial Java applications are becoming more common. • Small scale multi-processor machines are becoming more economical and popular. • Even on desktop machines multi-processors become common.

Introduction – cont. • Garbage collectors need to be highly efficient. • Especially on multi-processor environment. • Throughput of multi-threaded applications may drop dramatically.

Introduction – cont. • GC pause time should be minimized. • GC itself should be able to run in parallel.

Motivation • Create a generic scalable GC. • GC should be transparent. • Emphasis on small scale (desktop) multi-processor machines. • Should not degrade performance on single- processor machines.

Motivation – cont. • Throughput should not degrade as more clients and processors are added (global lock is bad). • GC itself should be scalable. • Even single-threaded application should benefit somewhat from multiprocessors.

This Work • A new GC that meets the above demands. • Plug-in as malloc/free replacement. • Design based on existing Mark and Sweep GC of T. Endo, K. Taura and A. Yonezawa (ETY97).

Roadmap • Related Work. • Algorithm description. • Parallelism and Performance Issues. • GC-friendly TLS. • Benchmarks. • Summary.

Related Work • Multi-threaded mutators (E. Petrank, 2000). • Thread local storage (B. Steensgard, 2000). • Tauro and Yonezawa work: • Emphasis on high-end systems. • No single-threaded support, two different libraries. • No malloc/free plug in. • No thread local storage.

Algorithm Context • Based on Boehm Mark / Lazy Sweep. • The heap as a big-bag of pages. • Each page holds object of a single size. • Each page has descriptor with mark bits for allocated objects.

Algorithm Context – cont. • Unmarked objects are reclaimed incrementally during allocation calls. • GC works conservatively, and treats everything as potential pointers. • Type information (from compiler / programmer) may be used. • But practice shows that conservative GCs work fine.

Algorithm Context – cont. • The GC never copies / moves objects. • Currently generational GC is not supported. • However, mentioned techniques can be used in generational GCs.

Mark Phase • Reachable objects (grey) are pushed to the stack. • Each stack entry contains base address a mark descriptor. • Mark descriptor contains locations of all the possible pointers relative to its base address.

Mark Phase – cont. • Initially roots are pushed the the stack. • Iteration on stack content. • Empty pages are recycled immediately. • Nearly full pages are ignored. • Remaining pages are enqueued by object size for later sweeping.

Allocation • Large objects are allocated on page level. • Allocator maintains free list of various small object sizes. • When free list of object size is empty, enqueued pages are swept. • If there are no such pages, new page is acquired.

Parallel Allocation • Some JVMs use per-thread allocated arenas to avoid synchronization on every small allocation request. • Size of arena is limited, and if unused, leads to unnecessary memory consumption. • Our algorithm uses an improved scheme, using free lists instead of contiguous arenas.

Parallel Allocation – cont. • Each thread has 64 or 48 free list headers for different objects size (No locks). • Larger object are allocated by global free list (Lock required). • Since large object allocation is expensive anyway, synchronization cost is amortized.

Parallel Allocation – cont. • Initially, allocate from global free list. • After a threshold start allocate on thread local free list. • Avoidance of allocating unnecessary storage.

Parallel Allocation – cont. • Allocating from the thread local free list does not require synchronization. • Locks are required only for taking page from global list or from waiting to be swept queue.

Parallel Marking • At process startup N-1 marker threads are created. • When mutator initiates GC (on allocation), it stops all other mutators (Stop the world) and wakes up the marker threads.

Parallel Marking – cont. • The mark stack we discussed before is used as a global list of waiting mark tasks. • Actually the global mark stack is acting as queue. • Initially the Mark stack in filled with roots.

Parallel Marking – cont. • Each marker thread atomically removes 1-5 items from the stack and inserts it into thread local stack. • Marker thread iteratively marks its local stack. • When local stack is emptied, more items are fetched from global stack.

Parallel Marking – cont.

Parallel Marking – cont. • Objects might be marked twice. • Returning items to global stack occurs in the following scenarios: • When mark thread discovers that global stack is empty, and there are entries in local stack (For load balancing). • In case of local stack overflow. • Those occasions require lock, however occur rarely.

Parallel Marking – Cheats and Tricks. • Shared pointer of the next mark task is held and maintained by Compare and Swap (to avoid unneeded scans). • Large objects are split before the marking phase to increase parallelism.

Mark bit representation • Each page has an array of mark bits. • A problem may rise while two marker threads attempt to update adjacent mark bits concurrently. • We need to find an atomic way to update a mark bit.

Mark bit representation – cont. • How to avoid update of same word by two threads: • By Read and Compare and Swap (Expensive operation). • By using byte instead of bit. (To compensate on space loss, we require object sizes to be multiple of 2 words). • Mark bits holds 1/8 of heap size. Leads to extra-space and more cache misses. • However reduce number of instructions.

GC-friendly TLS • We need a way to quickly generate a pointer of the TLS data. • Most JVMs uses a register for holding thread context data. • Operating system gives services to access thread local storage.

GC-friendly TLS – cont. • On posix • pthread_key_create. • pthread_set_specific. • pthread_get_specific. • On Windows • TlsAlloc. • TlsSetValue. • TlsGetValue.

GC-friendly TLS – cont. • Getting TLS value should have highest performance. • It need to be called every allocation in order to get the thread free-list headers. • pthread_getspecific performance are inadequate.

GC-friendly TLS – cont. • High performance TLS implementation. • Each TLS key is actually a pointer to a data structure containing two kinds of data. • Hash table mapping thread Ids to associated TLS. • Hash table for fast lookup (cache).

GC-friendly TLS – cont. • We calculate quick thread id (qtid) and looking in the fast lookup cache. • In case entry is not in cache, we’ll lookup in the main hash table. • Access on cache hit requires only 4 memory references.

GC-friendly TLS – cont. • Updates of hash table require locks, however the structure always remains consistence. • Lookup data does not require lock.

GC-friendly TLS – cont.

GC-friendly TLS – cont. • This implementation is applicable only for GC environment. • Upon thread termination, its TLS data is not freed. Freeing the hash table is made by the GC. • On non-GC environment, freeing the hash entries forces us to synchronize reader access.

Benchmarking • Intel Pentium Pro 200 on Linux RedHat 7. • 1-4 processors machines. • Single 66MHZ bus (which may cause bus contention).

Benchmarking – cont. • RH7. • RH7 Single (thread unsafe). • Hoard. • GC. • GC Single (thread unsafe). • SGC.

Ghostscript • No benefit for GC. • Large allocations (avg. 97 bytes). • Single-threaded.

MT_GCBench2 • Artificial Benchmark. • Construct/Destruction of 16 bytes objects of binary trees. • Good performance of GC.

Larson • Artificial Benchmark. • One thread allocates, another deallocates. • Good performance of Hoard.

Larson Small • Larson test with small objects. • Better performance of GC.

Benchmark Summary • Our GC is good for small allocated object. • Advantages of GC on parallel malloc/free. • No per-object lock on deallocations (objects are deallocated together). • No per-object lock on allocations (We can reuse memory from thread local storage).

Summary • We presented a scalable GC allocator. • We touched some GC related multi-processors performance issues. • We saw an algorithms with advantage that GC environment has. • We saw in the benchmarks it works quite well.

Few Words about Hoard Allocator • Hoard allocator by D. Berger, K. McKinley, D. Blumofe and P. Wilson (November 2000). • Not a GC but an allocator. • A malloc/free replacement. • “Compete” Boehm GC.

Hoard – cont. • Highly scalable (for high-end machines). • Low fragmentation (Blowup). • Avoids false cache lines sharing. • Based on global heap, and per-processor heaps.

Questions?

The end!

Fast Multiprocessor Memory Allocation and Garbage Collection

Fast Multiprocessor Memory Allocation and Garbage Collection

Presentation Transcript

Garbage Collection

Parallel and Concurrent Real-time Garbage Collection Part I: Overview and Memory Allocation Subsystem

The Transactional Memory / Garbage Collection Analogy

Garbage Collection

Garbage Collection

The Transactional Memory / Garbage Collection Analogy

Garbage collection

Garbage Collection and Memory Management

Smalltalk Implementation: Memory Management and Garbage Collection

Garbage Collection

Garbage collection

Memory allocation, garbage collection

Garbage Collection

Garbage Collection

Garbage Collection