480 likes | 570 Views
Fast Multiprocessor Memory Allocation and Garbage Collection. Based on Hans-J. Boehm, HP Laboratories, December 2000 Presented by Yair Sade. About the Author. Works in HP labs. Research on high-performance GCs. gcj runtime is based his his ideas.
E N D
Fast Multiprocessor Memory Allocation and Garbage Collection Based on Hans-J. Boehm, HP Laboratories, December 2000 Presented by Yair Sade
About the Author • Works in HP labs. • Research on high-performance GCs. • gcj runtime is based his his ideas. • There are also commercial GCs based on his ideas (Geodesic GC). • Etc...
Introduction • Large scale commercial Java applications are becoming more common. • Small scale multi-processor machines are becoming more economical and popular. • Even on desktop machines multi-processors become common.
Introduction – cont. • Garbage collectors need to be highly efficient. • Especially on multi-processor environment. • Throughput of multi-threaded applications may drop dramatically.
Introduction – cont. • GC pause time should be minimized. • GC itself should be able to run in parallel.
Motivation • Create a generic scalable GC. • GC should be transparent. • Emphasis on small scale (desktop) multi-processor machines. • Should not degrade performance on single- processor machines.
Motivation – cont. • Throughput should not degrade as more clients and processors are added (global lock is bad). • GC itself should be scalable. • Even single-threaded application should benefit somewhat from multiprocessors.
This Work • A new GC that meets the above demands. • Plug-in as malloc/free replacement. • Design based on existing Mark and Sweep GC of T. Endo, K. Taura and A. Yonezawa (ETY97).
Roadmap • Related Work. • Algorithm description. • Parallelism and Performance Issues. • GC-friendly TLS. • Benchmarks. • Summary.
Related Work • Multi-threaded mutators (E. Petrank, 2000). • Thread local storage (B. Steensgard, 2000). • Tauro and Yonezawa work: • Emphasis on high-end systems. • No single-threaded support, two different libraries. • No malloc/free plug in. • No thread local storage.
Algorithm Context • Based on Boehm Mark / Lazy Sweep. • The heap as a big-bag of pages. • Each page holds object of a single size. • Each page has descriptor with mark bits for allocated objects.
Algorithm Context – cont. • Unmarked objects are reclaimed incrementally during allocation calls. • GC works conservatively, and treats everything as potential pointers. • Type information (from compiler / programmer) may be used. • But practice shows that conservative GCs work fine.
Algorithm Context – cont. • The GC never copies / moves objects. • Currently generational GC is not supported. • However, mentioned techniques can be used in generational GCs.
Mark Phase • Reachable objects (grey) are pushed to the stack. • Each stack entry contains base address a mark descriptor. • Mark descriptor contains locations of all the possible pointers relative to its base address.
Mark Phase – cont. • Initially roots are pushed the the stack. • Iteration on stack content. • Empty pages are recycled immediately. • Nearly full pages are ignored. • Remaining pages are enqueued by object size for later sweeping.
Allocation • Large objects are allocated on page level. • Allocator maintains free list of various small object sizes. • When free list of object size is empty, enqueued pages are swept. • If there are no such pages, new page is acquired.
Parallel Allocation • Some JVMs use per-thread allocated arenas to avoid synchronization on every small allocation request. • Size of arena is limited, and if unused, leads to unnecessary memory consumption. • Our algorithm uses an improved scheme, using free lists instead of contiguous arenas.
Parallel Allocation – cont. • Each thread has 64 or 48 free list headers for different objects size (No locks). • Larger object are allocated by global free list (Lock required). • Since large object allocation is expensive anyway, synchronization cost is amortized.
Parallel Allocation – cont. • Initially, allocate from global free list. • After a threshold start allocate on thread local free list. • Avoidance of allocating unnecessary storage.
Parallel Allocation – cont. • Allocating from the thread local free list does not require synchronization. • Locks are required only for taking page from global list or from waiting to be swept queue.
Parallel Marking • At process startup N-1 marker threads are created. • When mutator initiates GC (on allocation), it stops all other mutators (Stop the world) and wakes up the marker threads.
Parallel Marking – cont. • The mark stack we discussed before is used as a global list of waiting mark tasks. • Actually the global mark stack is acting as queue. • Initially the Mark stack in filled with roots.
Parallel Marking – cont. • Each marker thread atomically removes 1-5 items from the stack and inserts it into thread local stack. • Marker thread iteratively marks its local stack. • When local stack is emptied, more items are fetched from global stack.
Parallel Marking – cont. • Objects might be marked twice. • Returning items to global stack occurs in the following scenarios: • When mark thread discovers that global stack is empty, and there are entries in local stack (For load balancing). • In case of local stack overflow. • Those occasions require lock, however occur rarely.
Parallel Marking – Cheats and Tricks. • Shared pointer of the next mark task is held and maintained by Compare and Swap (to avoid unneeded scans). • Large objects are split before the marking phase to increase parallelism.
Mark bit representation • Each page has an array of mark bits. • A problem may rise while two marker threads attempt to update adjacent mark bits concurrently. • We need to find an atomic way to update a mark bit.
Mark bit representation – cont. • How to avoid update of same word by two threads: • By Read and Compare and Swap (Expensive operation). • By using byte instead of bit. (To compensate on space loss, we require object sizes to be multiple of 2 words). • Mark bits holds 1/8 of heap size. Leads to extra-space and more cache misses. • However reduce number of instructions.
GC-friendly TLS • We need a way to quickly generate a pointer of the TLS data. • Most JVMs uses a register for holding thread context data. • Operating system gives services to access thread local storage.
GC-friendly TLS – cont. • On posix • pthread_key_create. • pthread_set_specific. • pthread_get_specific. • On Windows • TlsAlloc. • TlsSetValue. • TlsGetValue.
GC-friendly TLS – cont. • Getting TLS value should have highest performance. • It need to be called every allocation in order to get the thread free-list headers. • pthread_getspecific performance are inadequate.
GC-friendly TLS – cont. • High performance TLS implementation. • Each TLS key is actually a pointer to a data structure containing two kinds of data. • Hash table mapping thread Ids to associated TLS. • Hash table for fast lookup (cache).
GC-friendly TLS – cont. • We calculate quick thread id (qtid) and looking in the fast lookup cache. • In case entry is not in cache, we’ll lookup in the main hash table. • Access on cache hit requires only 4 memory references.
GC-friendly TLS – cont. • Updates of hash table require locks, however the structure always remains consistence. • Lookup data does not require lock.
GC-friendly TLS – cont. • This implementation is applicable only for GC environment. • Upon thread termination, its TLS data is not freed. Freeing the hash table is made by the GC. • On non-GC environment, freeing the hash entries forces us to synchronize reader access.
Benchmarking • Intel Pentium Pro 200 on Linux RedHat 7. • 1-4 processors machines. • Single 66MHZ bus (which may cause bus contention).
Benchmarking – cont. • RH7. • RH7 Single (thread unsafe). • Hoard. • GC. • GC Single (thread unsafe). • SGC.
Ghostscript • No benefit for GC. • Large allocations (avg. 97 bytes). • Single-threaded.
MT_GCBench2 • Artificial Benchmark. • Construct/Destruction of 16 bytes objects of binary trees. • Good performance of GC.
Larson • Artificial Benchmark. • One thread allocates, another deallocates. • Good performance of Hoard.
Larson Small • Larson test with small objects. • Better performance of GC.
Benchmark Summary • Our GC is good for small allocated object. • Advantages of GC on parallel malloc/free. • No per-object lock on deallocations (objects are deallocated together). • No per-object lock on allocations (We can reuse memory from thread local storage).
Summary • We presented a scalable GC allocator. • We touched some GC related multi-processors performance issues. • We saw an algorithms with advantage that GC environment has. • We saw in the benchmarks it works quite well.
Few Words about Hoard Allocator • Hoard allocator by D. Berger, K. McKinley, D. Blumofe and P. Wilson (November 2000). • Not a GC but an allocator. • A malloc/free replacement. • “Compete” Boehm GC.
Hoard – cont. • Highly scalable (for high-end machines). • Low fragmentation (Blowup). • Avoids false cache lines sharing. • Based on global heap, and per-processor heaps.