180 likes | 263 Views
A New Approach to Parallelising Tracing Algorithms . Cosmin E. Oancea, Alan Mycroft & Stephen M. Watt. Computer Laboratory University of Cambridge. Computer Science Department University of Western Ontario. I. Motivation & High Level Goal.
E N D
A New Approach to Parallelising Tracing Algorithms Cosmin E. Oancea, Alan Mycroft & Stephen M. Watt Computer Laboratory University of Cambridge Computer Science Department University of Western Ontario
I. Motivation & High Level Goal • We study more scalable algorithms for parallel tracing: • memory management is the primary motivation, but • do not claim immediate improvements to state-of-the-art GC. • Tracing is important to computing: • sequential & flat memory model – well understood, • parallel & multi-level memory – less clear: • processor communication cost grows w.r.t. raw instr speed x P x ILP • Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path.
I. Abstract Tracing Algorithm 1. mark and process any unmarked child of a marked node; 2. until no further marking is possible. • Assume an initialisation phase has already marked and processed some root nodes. • Implementing the implicit fix-point via worklists, yields: 1. pick a node from a worklist; 2. if unmarked then mark it, process it, and add its unmarked childreen to worklists; 3. repeat until all worklists are empty.
I. Worklist Semantics: Classical • What should worklists model? • Classical approach: processing semantics. Worklist 1 Worklist 2 Worklist 3 Worklist 4 • Worklisti stores nodes to be processed by processori!
I. Classic Algorithm while (!worklist.isEmpty()) { int ind = 0; Object from_child, to_child, to_obj = worklist.deqRand(); foreach( from_child in to_obj.fields() ) { ind++; atomic{ if(from_child.isForwarded())continue; to_child = copy(from_child); setForwardingPtr(from_child,to_child); } to_obj.setField(to_child, ind-1); queue.enqueue(to_child); } } • Two layers of synchronisation: • Worklist level – small overhead via deque (Arora et al.) or work tealing (Michael et al.) • Frustrating atomic block – gives idempotent copy, thus enables the above small overhead worklist-access solutions.
I. Related Work • Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions: • Object stealing: Arora et al. Flood et al., Endo et al. ... • Block-based approaches: Imai and Tick, Attanasio et al., Marlow et al., ... • Free-of-locks solutions via exploiting immutable data: Doligez and Leroy, Huelsbergen and Larus • Memory-centric solutions – studied only in the sequential case: Shuf et al., Demers et al., Chicha and Watt.
II. Memory-Centric Tracing (High Level) • L == memory partition (local) size; gives the trade-off between locality of reference and load balancing. • Worklist j stores slots: the to-space address pointing to a from-space field f of the currently copied/scanned objecto &&j = ( o.f quo L ) rem N
II. Memory-Centric Tracing (High Level) 1. Arrow Semantics: double ended – copy to-space, dashed – insert in queue, solid – slots pointing to fields 1. Each worklist w is owned by at most one collector c (owner) 2. Forwarded slots of c: those slots belonging to a partition owned by c, but discovered by another collector. 3. Eager strategy for acquiring worklists ownership. Initially all roots are placed in worklists, if non-empty owned. Dispatching Slots to Worklists or Forwarding Queues
II. Memory-Centric Tracing Implem. • Each collector processes its forwarding queues (size F) • Empty worklists are released (ownership). • Each collector processes F*P*4 items from its owned worklists (4 empirically chosen – forwarding ratio inv). • No locking when accessing worklists or when copying. • L (local partition size) gives the locality-of-reference level. • Repeat until no owned worklists && all forw. queues empty && all worklists empty.
II. Forwarding Queues on INTEL IA-32 • Implement inter-processor communication: • with P collectors have a PxP matrix of queues; entry (i,j) holds items enqueued by collector i and dequeued by j • wait-free, lock-free and mfence-free IA-32 implementation. volatile int tail=0, head=0, buff[F]; next : k -> (k+1)%F; bool enq(Address slot) { bool is_empty() int new_tl=next(tail); { return head == tail; } if(new_tl == head) return false; Address deq() { buff[tail] = slot; Address slot= buff[head]; tail = new_tl; head = next(head); return true; return slot; } }
II. Forwarding Queues on INTEL IA-32 • The sequentially inconsistent pattern occurs, but algorithm still safe: • head & tail interaction – reduces to a collector failing to deq from a non-empty list (and to enq into a non-full list); • buff[tail_prev] & head==tail_prev interaction is safe because writes are not re-ordered. a = b = 0; // Initially // (two enq) || (two is_empty; deq) // // Proc 1 Proc 2 // Proc i Proc j a = 1; b = 1; buff[tail]=...; head=next(head); // mfence; mfence; tail =...; if(head!=tail) x = a; y = b; if(new_tl==head) ..=buff[head]; // x == 0 & y == 0!
II. Dynamic Load Balancing • Small partitions (64K) -- OK under static ownership: • grey object -- randomly distributed among the N partitions, • still gives some locality of reference (otherwise forwarding would be too expensive) • Larger partitions may need dynamic load balancing: • Partition ownership must be transferred: • A starving collector c signals nearby collectors; these may release ownership of an owned worklist w while placing an item of w on collector c's forwarding queue. • Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it (Michael et al.)!
II. Optimisation; Run-Time Adaptation • Inter-collector producer-consumer relations are detected when forwarding queues are found full (F*P*4 processed items/iter): • transfer ownership to the producer collector to optimise forwarding. • Run-time adapt: monitor forw ratio (FR) & load balancing (LB): • start with large L; while poor LB decrease L • if FR > FR_MAX or L < L_MIN switch to classical!
III. Empirical Results – Small Data • Two quad-core AMD Opteron machine on small live data-sets applications against MMTK: • Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. • Heap Size = 120-200M, IFRav = 4.2, L = 64K.
III. Empirical Results – Large Data • Two quad-core AMD Opteron machine on large live data-sets applications against MMTK: • Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, TSP, Perimet, BH. • Heap Size > 500M, IFR average = 6.3, L = 128K.
III. Empirical Results – Eclipse • Quad-core Intel machine on Eclipse (large live data-set): • Heap Size = 500M, IFR average = (only) 2.6 for L = 512K, otherwise 2.1!
III. Empirical Results – Jython • Two quad-core AMD machine on Jython: • Heap Size = 200M, IFR average = (only) 3.0!
III. Conclusions • Memory-centric algorithms may be an important alternative to processing-centric algorithms, especially on non-homogeneous hardware. • How to explicitly represent and optimise two abstractions: locality of reference (L) and inter-processor communication (FR). L trade-offs locality for load balancing. • Robust behaviour: scales well with both data size and number of processors.