Efficiently Answering Reachability Queries on Large Directed Graphs

Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU), and Haixun Wang (IBM T.J. Watson)

Reachability Query The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ? ?Query(1,11) Yes ?Query(3,9) No 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2 Directed Graph  DAG (directed acyclic graph) by coalescing the strongly connected components

Applications • XML • Biological networks • Ontology • Knowledge representation (Lattice operation) • Object programming (Class relationship) • Distributed systems (Reachable states) Graph Databases

Prior Work 2-HOP (O(nm1/2), and O(n4)), HOPI, and heuristic algorithms

Limitation of Tree-based approaches • Finding a good tree cover is expensive • Tree cover cannot represent some common types of DAGs, like Grid • Compression limitations • Chain (1-parent, 1-child) • Tree (1-parent, multiple children) • Most existing methods which utilize the tree cover are greatly affected by how many edges are left uncovered

Overview of Path-Tree • Chain->Tree->Path-Tree (2 parents / multiple children) • Path-tree cover is a spanning subgraph of G in a tree shape (T) • A node in the tree T corresponds to a path in G and an edge in T corresponds to the edges between two paths in G • 3-tuple labeling exists for any path-tree to answer reachability query in O(1)

Path-Tree in a Nutshell 15 14 P4 11 13 10 12 P2 6 7 8 9 P4 P1 P3 3 4 5 P3 1 2 P2 P1 Path-Graph is not necessarily a planar graph The reachability between any two nodes can be answered in O(1)

Key Problems • How to construct a path-tree? • Algorithm • How can a path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality

Constructing Path-Tree • Step 1: Path-Decomposition of DAG • Step 2: Minimal Equivalent Edge Set between any two paths • Step 3: Path-Graph Construction • Step 4: Path-Tree Cover Extraction

Step 1: Path-Decomposition 15 (PID,SID) =(2, 5) 14 11 For any two nodes (u, v) in the same path, u  v if and only if (u.sid  v.sid) 13 10 12 6 7 8 9 P4 3 4 5 P3 1 2 P2 P1 Simple linear algorithm based on topological sort can achieve a path-decomposition

Step 2: Minimal equivalent edge set The reachability between any two paths can be captured by a unique minimal set of edges 15 15 14 14 11 11 13 10 13 10 6 7 P1 P2 P1  P2 6 7 3 4 3 4 1 2 1 2 P2 P2 P1 P1 The edges in the minimal equivalent edge set do not cross (always parallel)!

Step 3: Path-Graph Construction Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge 15 14 P2 11 2 4 13 10 12 5 P4 P1 2 2 1 1 6 7 8 9 1 P4 P3 3 4 5 P3 Weighted Directed Path-Graph 1 2 P2 P1

Step 4: Extracting Path-Tree Cover P2 P2 2 2 4 5 5 P4 P4 P1 P1 2 2 2 1 1 1 P3 P3 Weighted Directed Path-Graph Maximal Directed Spanning Tree Chu-Liu/Edmonds algorithm, O(m’+ k logk)

Key Problems • How to construct a path-tree? • Algorithm • How can path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality

3-Tuple Labeling for Reachability 15 [1,3] P2 14 11 [1,4] P4 13 10 12 P1 [1,1] [2,2] 6 7 8 P3 9 P4 3 4 5 Interval labeling (2-tuple) High-level description about paths Pi  Pj ? P3 1 2 P2 P1 DFS labeling (1-tuple)

DFS labeling 4 15 1 2 10 14 7 9 P3 P1 5 15 13 1 6 8 3 14 6 3 11 8 13 P2 11 4 10 2 7 12 5 P4 9 12 • Starting from the first vertex in the root-path • Always try to visit the next vertex in the same path • Label a node when all its neighbors has been visited • L(v)=N-x, x is the # of nodes has been labeled

3-Tuple Labeling for Reachability 4 15 1 2 10 14 7 9 P3 P1 5 15 13 1 6 8 3 14 6 3 11 8 13 P2 11 4 10 2 7 12 5 P4 [1,3] 9 12 P2 uv if and only if 1) Interval label I(u)  I(v) 2) DFS label L(u)  L(v) ?Query(9,15) P4[1,4]  P1[1,1] and 5 < 15 Yes ?Query(9,2)?Query(5,9) [1,4] P4 P1 [1,1] [2,2] P3

Transitive Closure Compression 15 Path-tree cover (including labeling) can be constructed inO(m + n logn) 14 11 13 10 12 6 7 8 9 3 4 5 1 2 An efficient procedure can compute and compress the transitive closure in O(mk), k is number of paths in path-tree

Key Problems • How to construct a path-tree? • Algorithm • How can path-tree help with reachability query? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality

Theoretical Analysis • Optimal Path-Tree Cover (OPTC) Problem: • Given a path-decomposition, what is the optimal path-tree cover to maximally compress the transitive closure? • OptIndex weight assignment based on computing the predecessor set • Optimal Path-Decomposition (OPD) Problem: • Assuming we only use path-decomposition to compress the transitive closure, what is the optimal path-decomposition to maximally compress the transitive closure? • Minimal-cost flow problem • What is the overall optimal path-decomposition?

Superiority of Path-Tree Cover • The optimal tree cover is a special case of path-tree cover when each vertex corresponds to a single path and the weight is based on OptIndex. • The path-tree cover approach can compress the transitive closure with size being smaller than or equal to the optimal tree cover approach (and consequently optimal chain cover approach).

Experimental Evaluation • Implementation in C++ • 12 Real datasets used in Dual-labeling paper and GRIPP paper • Synthetic datasets • Sparse DAG with edge density = 2 • AMD Opteron 2.0GHz/ 2GB/ Linux • PTree1 (OptIndex) and PTree2 • Mainly compare with Optimal Tree Cover

Real Datasets

Experimental Result (Real Data) On average 10 times better than Tree On average 3 times better than Tree

Experimental Result (Synthetic Data)

Conclusion • A novel Path-Tree structure is proposed to assist the compression of transitive closure and answering reachability query • Path-tree has potential to integrate with other existing methods to further improve the efficiency of reachability query processing

Thanks!!

Step 3: Path-Graph Construction Weight reflects the penalty if we exclude this path-tree edge 15 14 P2 11 2 4 13 10 12 5 P4 P1 2 2 1 1 6 7 8 9 1 P4 P3 3 4 5 P3 Weighted Directed Path-Graph 1 2 P2 P1

15 14 11 13 10 6 7 3 4 1 2 P2 P1 P1 P2 Step 2: Constructing Minimal Equivalent Edge Set (PiPj) • Ordering the vertices in Pi and Pj by decreasing order • Finding the first vertex v in P_j that P_i can reach • Finding the last vertex u in P_i that reach v • Removing all the edges cross (u,v) and • repeat 2-4

3-Tuple Labeling for Reachability 15 [1,3] P2 14 11 [1,4] P4 13 10 12 P1 [1,1] [2,2] 6 7 8 P3 9 P4 3 4 5 Interval labeling (2-tuple) High-level description about paths Pi  Pj ? P3 1 2 P2 P1 DFS labeling (1-tuple)

Efficiently Answering Reachability Queries on Large Directed Graphs

Efficiently Answering Reachability Queries on Large Directed Graphs

Presentation Transcript

Directed graphs

Directed Graphs

Answering Distance Queries in directed graphs using fast matrix multiplication

Directed Graphs

Directed Graphs

Directed Graphs

GRAIL: Scalable Reachability Index for Large Graphs

Property Directed Reachability (PDR)

Answering Queries: Problems

GRAIL: Scalable Reachability Index for Large Graphs VLDB2010

Answering Approximate Queries Efficiently

Answering distance queries in directed graphs using fast matrix multiplication

Answering Relationship Queries on the Web

Directed Graphs

An Efficient Algorithm for Answering Graph Reachability Queries

Directed Graphs

Path-Hop: efficiently indexing large graphs for reachability queries

Answering distance queries in directed graphs using fast matrix multiplication

Directed Graphs

Answering Approximate Queries Efficiently

Reachability in Directed Graph s

Directed Graphs