GRAIL: Scalable Reachability Index for Large Graphs

GRAIL: Scalable Reachability Index for Large Graphs H. Yıldırım, V. Chaoji, and M. J. Zaki, "GRAIL: a scalable index for reachability queries in very large graphs," The VLDB Journal—The International Journal on Very Large Data Bases, vol. 21, pp. 509-534, 2012. Some slides are taken from original paper’s presentation Soheila Abrishami

INTRODUCTION • Problem Definition • Given a directed graph and two nodes , a reachability query asks if there exists a path from to in . If can reach , we denote it as • Application • Semantic Web is composed of RDF/OWL data, reachability queries can be used to infer the relationships among the objects. • Network biology, reachability querying plays a role in protein-protein interaction networks and gene networks.

DAG • The problem of reachability on directed graphs can be reduced to reachability on directed acyclic graphs (DAGs) • DAG • Each node represents a strongly connected component of original graph, each edge represents whether one component can reach another.

DAG… a: A directed graph. b: Corresponding DAG

DAG… • Answer whether node can reach in G • look up their corresponding strongly connected components, and respectively, which are the nodes in G′. If = , then and reach , If != , then we pose the question whether can reach in G′. • Reachability queries on the original graph can be answered on the DAG, and thus we will discuss methods for reachability only on DAGs.

Trade-off between Query Time and Index Size

Interval labeling • Reachability problem on trees can be solved effectively by interval labeling ; linear time and space for constructing the index, and provides constant time querying. • Definition • each node with a range = [, ] • A reachability query can be answered by comparing the corresponding intervals, i.e., can reach if and only if

Interval labeling… • By considering the min-post-labeling method for trees • : denotes the rank of the node in a post-order traversal of the tree • =min • This figure shows the min-post-labeling for an example tree. • For example, 9, since , but 7, since

Ensure that a node is not visited more than once, and a node will keep the post-order rank of its first visit. Interval containment of nodes in a DAG is not exactly equivalent to reachability. For example, .(false positive or exceptions) Generalize the interval labeling to a DAG

The GRAIL Approach • GRAIL uses min-post-labeling directly on the directed acyclic graph • GRAIL employs multiple min-post-intervals that are obtained via randomgraph traversals. symbol d denote the number of intervals to keep per node • The key idea is to do fast elimination for pairs of query nodes for whom non-reachability can be determined via the intervals.

The GRAIL Approach • Definition • In GRAIL, for a given node u, the new label is given as , where is the interval label obtained from the (random) traversal of the DAG, and , where d is dimension or number of intervals. We say that is contained in , denoted as , if and only if for all ,then we can conclude that v, as per the proposed lemma. • Two main issues in GRAIL • i) how to compute the d random interval labels while indexing • ii) how to deal with exceptions, while querying

Index Construction • The index construction step in GRAIL is very straightforward; the desired number of post-order interval labels are generated by simply changing the visitation order of the children randomly during each depth-first traversal. • The best strategy is to cease labeling after a small number of dimensions (such as 5), with reduced exceptions, rather than trying to totally eliminate all exceptions

Reachability Queries • To answer reachability queries between two nodes, u and v, GRAIL adopts a two-pronged approach. GRAIL first checks whether . If so, we can immediately conclude that , by lemma1. On the other hand, if , nothing can be concluded immediately since we know that the index can have false positives, i.e., exceptions. • Keeping explicit exception lists per node does not scale to very large graphs. Default approach in GRAIL is to use a “smart” DFS, with recursive containment check based pruning, to answer queries. • Querying takes time if and the worst case complexity is for the DFS

Experiments • Large graphs: 700K to 25M nodes with degrees 0.5 to 5 • Only GRAIL and GRIPP scale on these datasets • While GRIPP’s index size is smaller than GRAIL’s, its construction time can be up to 40 times slower than GRAIL

Experiments… • GRAIL outperforms GRIPP by orders of magnitude • GRAIL is faster than BFS-L (the best among the search-based methods) by 3–40 times on the denser graph Query times (ms)

Conclusion • GRAIL : a lightweight indexing scheme • Easy to implement and scalable • Based on interval labeling • Able to index very large and dense graphs on which existing methods fail

Thank you! Questions?

GRAIL: Scalable Reachability Index for Large Graphs