1 / 48

Fast Multiprocessor Memory Allocation and Garbage Collection

Fast Multiprocessor Memory Allocation and Garbage Collection. Based on Hans-J. Boehm, HP Laboratories, December 2000 Presented by Yair Sade. About the Author. Works in HP labs. Research on high-performance GCs. gcj runtime is based his his ideas.

Download Presentation

Fast Multiprocessor Memory Allocation and Garbage Collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Multiprocessor Memory Allocation and Garbage Collection Based on Hans-J. Boehm, HP Laboratories, December 2000 Presented by Yair Sade

  2. About the Author • Works in HP labs. • Research on high-performance GCs. • gcj runtime is based his his ideas. • There are also commercial GCs based on his ideas (Geodesic GC). • Etc...

  3. Introduction • Large scale commercial Java applications are becoming more common. • Small scale multi-processor machines are becoming more economical and popular. • Even on desktop machines multi-processors become common.

  4. Introduction – cont. • Garbage collectors need to be highly efficient. • Especially on multi-processor environment. • Throughput of multi-threaded applications may drop dramatically.

  5. Introduction – cont. • GC pause time should be minimized. • GC itself should be able to run in parallel.

  6. Motivation • Create a generic scalable GC. • GC should be transparent. • Emphasis on small scale (desktop) multi-processor machines. • Should not degrade performance on single- processor machines.

  7. Motivation – cont. • Throughput should not degrade as more clients and processors are added (global lock is bad). • GC itself should be scalable. • Even single-threaded application should benefit somewhat from multiprocessors.

  8. This Work • A new GC that meets the above demands. • Plug-in as malloc/free replacement. • Design based on existing Mark and Sweep GC of T. Endo, K. Taura and A. Yonezawa (ETY97).

  9. Roadmap • Related Work. • Algorithm description. • Parallelism and Performance Issues. • GC-friendly TLS. • Benchmarks. • Summary.

  10. Related Work • Multi-threaded mutators (E. Petrank, 2000). • Thread local storage (B. Steensgard, 2000). • Tauro and Yonezawa work: • Emphasis on high-end systems. • No single-threaded support, two different libraries. • No malloc/free plug in. • No thread local storage.

  11. Algorithm Context • Based on Boehm Mark / Lazy Sweep. • The heap as a big-bag of pages. • Each page holds object of a single size. • Each page has descriptor with mark bits for allocated objects.

  12. Algorithm Context – cont. • Unmarked objects are reclaimed incrementally during allocation calls. • GC works conservatively, and treats everything as potential pointers. • Type information (from compiler / programmer) may be used. • But practice shows that conservative GCs work fine.

  13. Algorithm Context – cont. • The GC never copies / moves objects. • Currently generational GC is not supported. • However, mentioned techniques can be used in generational GCs.

  14. Mark Phase • Reachable objects (grey) are pushed to the stack. • Each stack entry contains base address a mark descriptor. • Mark descriptor contains locations of all the possible pointers relative to its base address.

  15. Mark Phase – cont. • Initially roots are pushed the the stack. • Iteration on stack content. • Empty pages are recycled immediately. • Nearly full pages are ignored. • Remaining pages are enqueued by object size for later sweeping.

  16. Allocation • Large objects are allocated on page level. • Allocator maintains free list of various small object sizes. • When free list of object size is empty, enqueued pages are swept. • If there are no such pages, new page is acquired.

  17. Parallel Allocation • Some JVMs use per-thread allocated arenas to avoid synchronization on every small allocation request. • Size of arena is limited, and if unused, leads to unnecessary memory consumption. • Our algorithm uses an improved scheme, using free lists instead of contiguous arenas.

  18. Parallel Allocation – cont. • Each thread has 64 or 48 free list headers for different objects size (No locks). • Larger object are allocated by global free list (Lock required). • Since large object allocation is expensive anyway, synchronization cost is amortized.

  19. Parallel Allocation – cont. • Initially, allocate from global free list. • After a threshold start allocate on thread local free list. • Avoidance of allocating unnecessary storage.

  20. Parallel Allocation – cont. • Allocating from the thread local free list does not require synchronization. • Locks are required only for taking page from global list or from waiting to be swept queue.

  21. Parallel Marking • At process startup N-1 marker threads are created. • When mutator initiates GC (on allocation), it stops all other mutators (Stop the world) and wakes up the marker threads.

  22. Parallel Marking – cont. • The mark stack we discussed before is used as a global list of waiting mark tasks. • Actually the global mark stack is acting as queue. • Initially the Mark stack in filled with roots.

  23. Parallel Marking – cont. • Each marker thread atomically removes 1-5 items from the stack and inserts it into thread local stack. • Marker thread iteratively marks its local stack. • When local stack is emptied, more items are fetched from global stack.

  24. Parallel Marking – cont.

  25. Parallel Marking – cont. • Objects might be marked twice. • Returning items to global stack occurs in the following scenarios: • When mark thread discovers that global stack is empty, and there are entries in local stack (For load balancing). • In case of local stack overflow. • Those occasions require lock, however occur rarely.

  26. Parallel Marking – Cheats and Tricks. • Shared pointer of the next mark task is held and maintained by Compare and Swap (to avoid unneeded scans). • Large objects are split before the marking phase to increase parallelism.

  27. Mark bit representation • Each page has an array of mark bits. • A problem may rise while two marker threads attempt to update adjacent mark bits concurrently. • We need to find an atomic way to update a mark bit.

  28. Mark bit representation – cont. • How to avoid update of same word by two threads: • By Read and Compare and Swap (Expensive operation). • By using byte instead of bit. (To compensate on space loss, we require object sizes to be multiple of 2 words). • Mark bits holds 1/8 of heap size. Leads to extra-space and more cache misses. • However reduce number of instructions.

  29. GC-friendly TLS • We need a way to quickly generate a pointer of the TLS data. • Most JVMs uses a register for holding thread context data. • Operating system gives services to access thread local storage.

  30. GC-friendly TLS – cont. • On posix • pthread_key_create. • pthread_set_specific. • pthread_get_specific. • On Windows • TlsAlloc. • TlsSetValue. • TlsGetValue.

  31. GC-friendly TLS – cont. • Getting TLS value should have highest performance. • It need to be called every allocation in order to get the thread free-list headers. • pthread_getspecific performance are inadequate.

  32. GC-friendly TLS – cont. • High performance TLS implementation. • Each TLS key is actually a pointer to a data structure containing two kinds of data. • Hash table mapping thread Ids to associated TLS. • Hash table for fast lookup (cache).

  33. GC-friendly TLS – cont. • We calculate quick thread id (qtid) and looking in the fast lookup cache. • In case entry is not in cache, we’ll lookup in the main hash table. • Access on cache hit requires only 4 memory references.

  34. GC-friendly TLS – cont. • Updates of hash table require locks, however the structure always remains consistence. • Lookup data does not require lock.

  35. GC-friendly TLS – cont.

  36. GC-friendly TLS – cont. • This implementation is applicable only for GC environment. • Upon thread termination, its TLS data is not freed. Freeing the hash table is made by the GC. • On non-GC environment, freeing the hash entries forces us to synchronize reader access.

  37. Benchmarking • Intel Pentium Pro 200 on Linux RedHat 7. • 1-4 processors machines. • Single 66MHZ bus (which may cause bus contention).

  38. Benchmarking – cont. • RH7. • RH7 Single (thread unsafe). • Hoard. • GC. • GC Single (thread unsafe). • SGC.

  39. Ghostscript • No benefit for GC. • Large allocations (avg. 97 bytes). • Single-threaded.

  40. MT_GCBench2 • Artificial Benchmark. • Construct/Destruction of 16 bytes objects of binary trees. • Good performance of GC.

  41. Larson • Artificial Benchmark. • One thread allocates, another deallocates. • Good performance of Hoard.

  42. Larson Small • Larson test with small objects. • Better performance of GC.

  43. Benchmark Summary • Our GC is good for small allocated object. • Advantages of GC on parallel malloc/free. • No per-object lock on deallocations (objects are deallocated together). • No per-object lock on allocations (We can reuse memory from thread local storage).

  44. Summary • We presented a scalable GC allocator. • We touched some GC related multi-processors performance issues. • We saw an algorithms with advantage that GC environment has. • We saw in the benchmarks it works quite well.

  45. Few Words about Hoard Allocator • Hoard allocator by D. Berger, K. McKinley, D. Blumofe and P. Wilson (November 2000). • Not a GC but an allocator. • A malloc/free replacement. • “Compete” Boehm GC.

  46. Hoard – cont. • Highly scalable (for high-end machines). • Low fragmentation (Blowup). • Avoids false cache lines sharing. • Based on global heap, and per-processor heaps.

  47. Questions?

  48. The end!

More Related