Real-Time Garbage Collection Methods and Techniques

Parallel and ConcurrentReal-time Garbage CollectionPart II: Memory Allocation and Sweeping David F. Bacon T.J. Watson Research Center

Page Data Synchronization, Take 1 16 64 256 free

Page Data, Take 2 16 64 256 Thread 1 16 64 256 Thread 2 16 64 256 free

free 16 free 64 free 256 temp full full 256 16 64 ? free

Compare-and-Swap* boolean CAS(int* addr, int old, int new) { atomic { if (* addr == old) { *addr = new; return true; } else { return false; } } } • IBM 370 since 1970, Intel 80486 since 1989 • Load-Reserved: ARM 6 since 1991, IBM POWER2 since 1993

top Non-locking* Stack push(Page page) { do { Page t = top; page->next = t; } while (! CAS(&top, t, page); } Page pop() { Page t = null; do { t = top; if (t == null) return; Page n = t->next; } while (! CAS(&top, t, n); return t; }

A B C A top top top top Thread 2 a=pop() b=pop() Thread 1 x=pop() start C B n n b A t a t 1 2 C C B n A Thread 1 pop() finish t B b Thread 2 push(A) 3 4

Correct Non-locking Stack push(PageId page) { do { <PageId t, short v> = top; page->next = t; } while (! CAS(&top, <t,v>, <page,(v+1) % 0xffff>); } PageId pop() { PageId t = 0; do { <PageId t, short version> = top; if (t==0) return 0; } while (! CAS(&top, <t,v>, <next(t),(v+1) % 0xffff>); return t; }

Really Correct Non-locking Stack push(Page page) { do { Page t = LOAD_AND_RESERVE(&top); page->next = t; } while (! STORE_CONDITIONAL(&top, page)); } Page pop() { Page t = null; do { Page t = LOAD_AND_RESERVE(&top); if (t==null) return null; } while (! STORE_CONDITIONAL(&top, t->next)); return t; }

Coalescing ? free • Ideal: • getMultiplePages(n) returns best fit • Does not cause mutator activities to block • Does not cause collection activities to block • Does not suffer pathology under contention • Reality: tradeoffs between • Quality of fit • Probability of blocking • Overhead (number of CAS’s)

Cartesian Tree + Try-lock ? 44 45 46 47 33 34 35 50 51 21 22 62 63 11 24 59 25 x.left.size < x.size >= x.right.size x.left.address < x.address < x.right.address

Non-locking Rebalancing Tree • Homework… • …but I won’t use it. • Solution would be • Too complex to be reliable • Require too many CAS operations to be fast

The Fix: Transactional Memory AtomicCartesianTree extends CartesianTree { atomic Block getBlock() { return super.getBlock(); } atomic void releaseBlock(Block b) { super.releaseBlock(b); } atomic Block[] getMultipleBlocks(int n) { return super.getMultipleBlocks(n); } }

Hierarchical Non-locking Buddy? 21 22 24 25 11 33 34 35 50 51 44 45 46 47 59 62 63

So much for allocation…Garbage Collection

? p Z T Y X W U r Y X a a a a a a a a b b b b b b b b Handling the Concurrency Application (“mutator”) Threads GC Threads • Trace • Collect • Format • Load pointer • Store pointer • Allocate

Yuasa Snapshot Algorithm (1990) • Logically • Mark-and-Sweep Style Algorithm • Take a “copy-on-write” heap snapshot • Collect the garbage in that snapshot • Physically • Stop all threads • Copy their stacks (“roots”) • Force them to save over-written pointers • Trace roots and over-written pointers * Dijktstra’75, Steele’76

U Y T W X Z a a a a a a b b b b b b 1: Take Logical Snapshot Stack p r

r U Y T X W Z a a a a a a b b b b b b 2(a): Copy Over-written Pointers Stack p

U T W X Z Y Z Y W X a a a a a a a a a a b b b b b b b b b b 2(b): Trace Stack p * Color is per-object mark bit

s W Z U X Y T V X Z Y W a a a a a a a a a a a b b b b b b b b b b b 2(c): Allocate “Black” Stack p

s W Z U X Y T V W Z Y X a a a a a a a a a a a b b b b b b b b b b b 3(a): Sweep Garbage free Stack p

t s V2 T Y X Z U Y V Z X W a a a a a a a a a a a b b b b b b b b b b b 3(b): Allocate “White” free Stack p

t s X V2 Z T U W Y V X Y W Z V a a a a a a a a a a a a a b b b b b b b b b b b b b 4: Clear Marks free Stack p

Yuasa Algorithm Phases • Snapshot stack and global roots • Trace • Flip • Sweep • Flip • Clear Marks * Synchronous

Metronome-2 Concurrency GC Master Thread GC Worker Threads Application Threads (may do GC work)

Metronome-2 Phases • Clearing • Monitor Table clearing • JNI Weak Global clearing • Debugger Reference clearing • JVMTI Table clearing • Phantom Reference clearing • Re-Trace 2 • Trace Master • (Trace*) • (Trace Terminate***) • Class Unloading • Completion • Finalizer Wakeup • Class Unloading Flush • Clearable Compaction** • Book-keeping • Initiation • Setup • turn double barrier on • Root Scan • Active Finalizer scan • Class scan • Thread scan** • switch to single barrier, color to black • Debugger, JNI, Class Loader scan • Trace • Trace* • Trace Terminate*** • Re-materialization 1 • Weak/Soft/Phantom Reference List Transfer • Weak Reference clearing** (snapshot) • Re-Trace 1 • Trace Master • (Trace*) • (Trace Terminate***) • Re-materialization 2 • Finalizable Processing • Flip • Move Available Lists to Full List* (contention) • turn write barrier off • Flush Per-thread Allocation Pages** • switch allocation color to white • switch to temp full list • Sweeping • Sweep* • Switch to regular Full List** • Move Temp Full List to regular Full List* (contention) * Parallel ** Callback *** Single actor symmetric

Part 3: Sweep • Clearing • Monitor Table clearing • JNI Weak Global clearing • Debugger Reference clearing • JVMTI Table clearing • Phantom Reference clearing • Re-Trace 2 • Trace Master • (Trace*) • (Trace Terminate***) • Class Unloading • Completion • Finalizer Wakeup • Class Unloading Flush • Clearable Compaction** • Book-keeping • Initiation • Setup • turn double barrier on • Root Scan • Active Finalizer scan • Class scan • Thread scan** • switch to single barrier, color to black • Debugger, JNI, Class Loader scan • Trace • Trace* • Trace Terminate*** • Re-materialization 1 • Weak/Soft/Phantom Reference List Transfer • Weak Reference clearing** (snapshot) • Re-Trace 1 • Trace Master • (Trace*) • (Trace Terminate***) • Re-materialization 2 • Finalizable Processing • Flip • Move Available Lists to Full List* (contention) • turn write barrier off • Flush Per-thread Allocation Pages** • switch allocation color to white • switch to temp full list • Sweeping • Sweep* • Switch to regular Full List** • Move Temp Full List to regular Full List* (contention) * Parallel ** Callback *** Single actor symmetric

s W Z U X Y T V X Z Y W a a a a a a a a a a a b b b b b b b b b b b Assume Trace Has Finished Stack p

thread list pthread pthread_t writebuf 43 color BarrierOn BarrierOn false true pthread pthread_t writebuf 41 color epoch epoch 43 37 JVM Application Thread Structure 16 64 256 sweep  dblb  Thread 1 next inVM  16 64 256 sweep  dblb  Thread 2 next inVM 

“Move Available” Phase

free 16 free 64 free 256 temp full full 256 16 64 available ? free

free 16 free 64 free 256 temp full full 256 16 64 contention? available ? free

Generic Structure of Move Phase:Shared Monotonic Work Pool GC Master Thread GC Worker Threads Application Threads (may do GC work) pool of work

Performing a GC Work Quantum every (500us) tryToDoSomeWorkForGC(); tryToDoSomeWorkForGC() { short phase = enterPhase(); if (phase > 0) doQuantum(phase); leavePhase(); }

PhaseInfo phase workers moveavail 3 Phase Entry short enterPhase() { thread->epoch = Epoch.newest; while (true) <short phase, short workers> = PhaseInfo; if (workers == phaseTable[phase].maxWorkers) return -1; if (CAS(PhaseInfo,<phase,workers>,<phase,workers+1>)) return phase; } ABA?

PhaseInfo phase workers moveavail 3 Phase Exit short leavePhase() { while (true) <short phase, short workers> = PhaseInfo; if (workers > 1 || moreWorkForPhase(phase)) newPhase = <phase,workers-1>; else if (phase == TRACE_PHASE) newPhase = <phaseTable[phase].nextPhase, 1>; else newPhase = <phaseTable[phase].nextPhase, 0>; if (CAS(PhaseInfo, <phase,workers>, newPhase)) return newPhase; }

“Per-Thread Flush” Phase

thread list pthread pthread_t PhaseInfo phase workers writebuf flush 1 43 color color BarrierOn BarrierOn false true pthread pthread_t writebuf 41 color color epoch epoch 37 43 Flush Per-Thread Pages 16 64 256 sweep  sweep  dblb  Thread 1 next inVM  16 64 256 sweep  sweep  dblb  Thread 2 next inVM 

free 16 free 64 free 256 temp full full 256 16 64 Put Full White Pages on temp-full List available ? free

   How is this Phase Different? Per-Thread State Update GC Master Thread GC Worker Threads Application Threads (may do GC work) pool of work

Callback Mechanism bool doCallbacks(int callbackId) { bool alldone = true; LOCK(threadlist); for each (Thread thread in threadlist) if (hasCallbackWork(thread, callbackId); if (! thread.inVM && TRY_LOCK(thread)) doCallbackWorkForThread(thread, callbackId); UNLOCK(thread); else postCallback(thread, callbackId); alldone = false; UNLOCK(threadlist); return alldone; } * Non-locking list with cursor?

“Sweep” Phase

free 16 free 64 free 256 temp full full 64 256 16 Sweep Phase available ? free

“Switch Full List” Phase

thread list pthread pthread_t PhaseInfo phase workers writebuf switchfull 1 43 color color BarrierOn BarrierOn false true pthread pthread_t writebuf 41 color epoch epoch 43 37 Switch Back to Normal Full List 16 64 256 color sweep  sweep  dblb  Thread 1 next inVM  16 64 256 sweep  sweep  dblb  Thread 2 next inVM 

“Drain Temp Full” Phase

free 16 free 64 free 256 temp full full 64 256 16 Drain Temp Full available ? free

Done with Collection!! • But… • That Was the Easy Part • And there are still some problems • What if threads go out to lunch?

http://www.research.ibm.com/metronomehttps://sourceforge.net/projects/tuningforkvphttp://www.research.ibm.com/metronomehttps://sourceforge.net/projects/tuningforkvp

Real-Time Garbage Collection Methods and Techniques

Real-Time Garbage Collection Methods and Techniques

Presentation Transcript

Dynamic Memory Allocation II

Parallel and Concurrent Real-time Garbage Collection Part I: Overview and Memory Allocation Subsystem

Parallel, Real-Time Garbage Collection

Parallel and Concurrent Programming

Memory Part II Memory Stages and Processes

Parallel and Concurrent Haskell Part I

Parallel and Concurrent Haskell Part II

Tax-and-Spend: Democratic Scheduling for Real-time Garbage Collection

Garbage Collection and Memory Management

Smalltalk Implementation: Memory Management and Garbage Collection

Garbage Collection and Classloading

Dynamic Memory Allocation II

Parallel Memory Allocation

Parallel Garbage Collection

A Parallel, Real-Time Garbage Collector

Dynamic Memory Allocation II

Memory allocation, garbage collection

Fast Multiprocessor Memory Allocation and Garbage Collection

Dynamic Memory Allocation II

Dynamic Memory Allocation II

Dynamic Memory Allocation II