A Shape Analysis for Optimizing Parallel Graph Programs

A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos1 Keshav Pingali1,2 Roman Manevich2 Kathryn S. McKinley1 1: Department of Computer Science, The University of Texas at Austin 2: Institute for Computational Engineering and Sciences, The University of Texas at Austin

Motivation • Graph algorithms are ubiquitous • Goal: Compiler analysis for optimizationof parallel graph algorithms Computer Graphics Computational biology Social Networks

Organization • Parallelization of graph algorithms in Galois system • Speculative execution • Example: Boruvka MST algorithm • Optimization opportunities • Reduce speculation overheads • Analysis problem: LockSet shape analysis • Lockset shape analysis • Abstract Data Type (ADT) modeling • Hierarchy summarization abstraction • Predicate discovery • Evaluation • Fast and infers all available optimizations • Optimizations give speedup up to 12x

Boruvka’sMinimum Spanning Tree Algorithm 7 1 6 c d e f lt 2 4 4 3 1 6 a b g d e f 5 7 4 4 a,c b g 3 • Build MST bottom-up • repeat { • pick arbitrary node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a single node

Parallelism in Boruvka • Algorithm = repeated application of operator to graph • Active node: • Node where computation is needed • Activity: • Application of operator to active node • Neighborhood: • Sub-graph read/written to perform activity • Unordered algorithms: • Active nodes can be processed in any order • Amorphous data-parallelism • Parallel execution of activities, subject to neighborhood constraints • Neighborhoods are functions of runtime values • Parallelism cannot be uncovered at compile time in general i2 i1 i3

Optimistic Parallelization in Galois • Programming model • Client code has sequential semantics • Library of concurrent data structures • Parallel execution model • Thread-level speculation (TLS) • Activities executed speculatively • Conflict detection • Each node/edge has associated exclusive lock • Graph operations acquire locks on read/written nodes/edges • Lock owned by another thread  conflict  iteration rolled back • All locks released at the end • Two main overheads • Locking • Undo actions i2 i1 i3

Overheads(I): Locking • Optimizations • Redundant locking elimination • Lock removal for iteration private data • Lock removal for lock domination • ACQ(P): set of definitely acquiredlocks per program point P • Given method call M at P: Locks(M)  ACQ(P)  Redundant Locking

Overheads (II): Undo actions foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } foreach (Node a : wl) { … … } Lockset Grows Failsafe Lockset Stable … Program point Pis failsafe if: Q : Reaches(P,Q)  Locks(Q)  ACQ(P)

Lockset Analysis • Redundant Locking • Locks(M)  ACQ(P) • Undo elimination • Q : Reaches(P,Q) Locks(Q)  ACQ(P) • Need to compute ACQ(P) GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Runtime overhead : Runtime overhead

Analysis Challenges • The usual suspects: • Unbounded Memory  Undecidability • Aliasing, Destructive updates • Specific challenges: • Complex ADTs: unstructured graphs • Heap objects are locked • Adapt abstraction to ADTs • We use Abstract Interpretation [CC’77] • Balance precision and realistic performance

Shape Analysis Overview Predicate Discovery Graph { @rep nodes @rep edges … } Set { @rep cont … } … … HashMap-Graph Set Spec Shape Analysis Graph Spec Tree-based Set Concrete ADT Implementations in Galois library ADTSpecifications Boruvka.java Optimized Boruvka.java

ADT Specification Abstract ADT state by virtual set fields Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges Set<Node> neighbors(Node n); } ... Set<Node> S1 = g.neighbors(n); ... @locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src) @op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Boruvka.java Graph Spec Assumption: Implementation satisfies Spec

Modeling ADTs Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } c src dst src dst dst src a b Graph Spec

Modeling ADTs Abstract State Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } nodes edges c src dst ret cont src dst dst src a b Graph Spec nghbrs

Abstraction Scheme • Parameterized by set of LockPaths: L(Path)o . o ∊ Path  Locked(o) • Tracks subset of must-be-locked objects • Abstract domain elements have the form: Aliasing-configs 2LockPaths … S1 S1 S2 S2 L(S2.cont) L(S2.cont) L(S1.cont) L(S1.cont) (S1 ≠S2) ∧ L(S1.cont) ∧ L(S2.cont) cont cont cont cont 

Joining Abstract States ( L(y.nd) )  (  L(y.nd) L(x.rev(src)) )  ( () L(x.nd) ) ( L(y.nd) )  ( () L(x.nd) ) ( L(y.nd) )  ( () L(x.nd) ) Aliasing is crucial for precision May-be-locked does not enable our optimizations #Aliasing-configs: small constant (6)

Example Invariant in Boruvka GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } • The immediate neighbors • of a and lt are locked lt a ( a ≠ lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst)∧ L(a.rev(dst).src) ∧ L(lt) ∧ L(lt.rev(dst)) ∧ L(lt.rev(src)) ∧ L(lt.rev(dst).src) ∧ L(lt.rev(src).dst) …..

Heuristics for Finding Paths S • Hierarchy Summarization (HS) • x.( fld )* • Type hierarchy graph acyclic  bounded number of paths • Preflow-Push: • L(S.cont) ∧ L(S.cont.nd) • Nodes in set S and their data are locked Set<Node> cont Node nd NodeData

Footprint Graph Heuristic • Footprint Graphs (FG)[Calcagno et al. SAS’07] • All acyclic paths from arguments of ADT method to locked objects • x.( fld | rev(fld) )* • Delaunay Refinement: L(S.cont) ∧ L(S.cont.rev(src)) ∧ L(S.cont.rev(dst)) ∧ L(S.cont.rev(src).dst) ∧ L(S.cont.rev(dst).src) • Nodes in set S and all of their immediate neighbors are locked • Composition of HS, FG • Preflow-Push: L(a.rev(src).ed) FG HS

Organization • Parallelization of graph algorithms in Galois system • Speculative execution • Example: Boruvka MST algorithm • Optimization opportunities • Reduce speculation overheads • Analysis problem: LockSet shape analysis • Shape analysis • Abstract Data Type modeling • Hierarchy summarization abstraction • Predicate discovery • Evaluation • Fast and infers all available optimizations • Optimizations give speedup up to 12x

Experimental Evaluation • Implement on top of TVLA • Encode abstraction by 3-Valued Shape Analysis [SRW TOPLAS’02] • Evaluation on 4 Lonestar Java benchmarks • Inferred all available optimizations • # abstract states practically linear in program size

Impact of Optimizations for 8 Threads 8-core Intel Xeon @ 3.00 GHz

Related Work • Safe programmable speculative parallelism [Prabhu et al. PLDI’10] • Focused on value speculation on ordered algorithms • Different rollback freedom condition • Transactional Memory compiler optimizations [Harris et al. PLDI’06, Dragojevic et al. SPAA’09] • Similar optimizations • Don’t target rollback freedom • Imprecise for unbounded data-structures • Optimizations for parallel graph programs [Mendez-Lojo et al. PPOPP’10] • Manual optimizations • Failsafe subsumes cautious • Verifying conformance of ADT implementation to specification • The Jahob project (Kuncak, Rinard, Wies et al.)

Conclusion • New application for static analysis • Optimization of optimistically parallelized graph programs • Novel shape analysis • Utilize observations on the structure of concrete states and programming style • Enables optimizations crucial for performance

Thank You!

Backup

Outline of Boruvka MST Code GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Pick arbitrary worklist node Find lightest neighbor Update neighbors of lightest Update worklist and MST

Approximating Sets of Locked Objects Reachability-based Scheme S1 S2 S1 S2 cont cont cont cont Reach(S1.cont) Reach(S2.cont) Reach(S1.cont) Reach(S2.cont) Reach(S2.cont) Reach(S2.cont) Reach(S2.cont)

Approximating Sets of Locked Objects Hierarchy Summarization Scheme S1 S2 (S1 ≠S2) ∧ L(S1.cont) ∧ L(S2.cont) S1 S2 cont cont L(S2.cont) L(S1.cont) cont cont L(Path) o . o ∊ Path  Locked(o) L(S2.cont) L(S1.cont)

Enabling Optimizations in Galois • Graph API uses flags to enable/disable locking and storing undo actions • removeEdge(Node src, Node dst, Flag f); ALL LOCKS UNDO Lock(src) Lock(dst) Lock( (src,dst) ) addEdge(src, dst); NONE Challenge: Find minimal flag per ADT method call Solution: Lockset analysis

Optimization Conditions • set of definitely acquiredlocks per program point • Given method call at : • Program point is failsafe if: • Infer from by simple backward analysis

Speculation Overheads and Optimizations

Modeling ADTs Abstract State Graph<ND,ED> { @rep set<Node> ns; // nodes @rep set<Edge> es; // edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } ret ns es nghbrs a b cont 7 5 c d 3 dst src 2 4 ed a b Graph Spec 5

Failsafe Points – Eliminating Undo Actions a lt CAUTIOUS OPERATOR [Mendez et al. PPOPP’10] g.neighbors(lt); GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } 7 c d 3 2 4 a b 5 Graph Node Graph Edge Edge Data

Boruvka’sMinimum Spanning Tree Algorithm 7 1 6 c d e f 7 2 4 4 3 a b g 3 5 • Build MST bottom up • repeat { • pick arbitrary active node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a singular node

Failsafe Points – Eliminating Undo Actions a lt CAUTIOUS OPERATOR [Mendez et al. PPOPP’10] g.neighbors(lt); GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Graph Node 7 Graph Edge c d Edge Data 3 2 4 a b 5 New Lock Redundant Acquired

lt Redundant Locking Example a GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Graph Node 7 Graph Edge c d Edge Data 3 2 4 a b 5 New Lock Redundant Acquired

Boruvka’sMinimum Spanning Tree Algorithm GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } 7 1 6 c d e f 7 2 4 4 3 a b g 3 5 • Build MST iteratively • Pick random active node • Contract edge with lightest neighbor

Parallelism in Boruvka’s Algorithm GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } 7 1 6 c d e f f 3 2 4 4 a b g 5 • Dependences between activities are functions of runtime values • Parallelism cannot be uncovered at compile time in general • Don’t Care Non-Determinism • All produced MSTs correct and optimal

Abstraction of a Single State State after first loop in Boruvka ( a lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst) ∧ L(a.rev(dst).src) ∧ UniqueRef(EdgeData) 7 lt c d 3 2 4 a a b 5

Hierarchy Summarization Intuition aNghbrsltNghbrs mst nIter wl g Iterator<Node> Set<Node> Gset<Node> GBag<Weight> Graph all ns cont gcont es bcont past, at,future Edge e,an src,dst Node a, lt, n ed Weight w,wan,minW nd Void

A Shape Analysis for Optimizing Parallel Graph Programs

A Shape Analysis for Optimizing Parallel Graph Programs

Presentation Transcript

Parallel Graph Algorithms

Parallel Graph Algorithms

Chronos: A Graph Engine for Temporal Graph Analysis

Analysis of Fork-Join Parallel Programs

Parallel Graph Algorithms

Parallel Graph Algorithms

Graph A Graph B Graph C A different shape not shown here

Graph Programs for Graph Algorithms

Parallel Graph Algorithms

Parallel Programs

Parallel Graph Algorithms

Modular Shape Analysis for Dynamically Encapsulated Programs

Interprocedural shape analysis for cutpoint-free programs

Parallel Graph Algorithms

Mizan : Optimizing Graph Mining in Large Parallel Systems

Shape Analysis by Graph Decomposition

Interprocedural shape analysis for cutpoint-free programs

Interprocedural Shape Analysis for Recursive Programs

Modular Shape Analysis for Dynamically Encapsulated Programs

Parallel Graph Algorithms

Graph Programs for Graph Algorithms