130 likes | 237 Views
Towards Graph Containment Search and Indexing. Chen Chen 1 , Xifeng Yan 2 , Philip S. Yu 2 , Jiawei Han 1 , Dong-Qing Zhang 3 , Xiaohui Gu 2 1 University of Illinois at Urbana-Champaign 2 IBM T.J. Watson Research Center 3 Thomson Research
E N D
Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip S. Yu 2, Jiawei Han 1, Dong-Qing Zhang 3, Xiaohui Gu 2 1 University of Illinois at Urbana-Champaign 2IBM T.J. Watson Research Center 3Thomson Research VLDB ’07, September 23-28, 2007, Vienna, Austria 2007. 12. 28 Summarized by Dongjoo Lee, IDS Lab., Seoul National University Presented by Dongjoo Lee, IDS Lab., Seoul National University
Contents • Graph Search • cIndex Basic Framework • Indexing Features • Indexing Model • cIndex-Basic • Hierarchical Indexing Models • cIndex-BottomUp, cIndex-TopDown • Index Maintenances • Experiments • Discussion
Graph Search SCAN Answers of Graph Containment Search Answers of Graph Search Sample Database Sample Query q gc Subgraph q Supergraph gc q Subgraph isomorphism • Traditional Graph Search • Find all supergraphs of a query graph • Graph Containment Search • Find all subgraphs of a query graph
Subgraph Indexing & Pruning f1 q1 ga q2 f2 gb q3 f3 gc SCAN f4 Queries Sample Database Indexed Subgraphs Feature-Graph Matrix • Traditional Graph Search - Inclusion Logic Given a query graph q and a database graph g D, if a feature f q and f g, then q g . That is, if feature f is in q then the graphs not having f are pruned. • Graph Containment Search - Exclusion Logic If a feature f q and f g, then g q. That is, if feature f is not in q then the graphs having f are pruned.
Basic Search Framework Search Time Let’s reduce these • Off-line index construction • Generate and select a feature set F from the graph database D • f ∈ F, Df = {g | f ⊆ g, g ∈D} • Search • Test indexed features in F against the query q which returns all f q, and compute the candidate query answer set, Cq = D − fDf (f q, f ∈ F). • Verification • Check each graph g in the database set Cq to see whether g is really a subgraph of q
Redundancy-Aware Feature Selection X X q1 q2 X X X X X q3 f1 f4 f2 f3 • Select minimal features that cover many graphs that other features didn’t cover from frequent features • Expected number of reduced subgraph isomorphism tests • Jf = np(1-p’) -1 • n : |D| • p : frequency of f in D • p’ : probability that a query graph having f • p p’=> Maximum at p = ½ • Frequent subgraph mining algorithm • FSG, GASTON, gSpan
Maximum Coverage With Cost Definition 6. (Maximum Coverage With Cost). Given a set of subsets S = {S1, S2, …, Sm} of the universal set U = {1, 2, … , n} and a cost parameter associated with any Si ∈ S, find a subset T of S such that |∪Si∈TSi|−|T| is maximized. U = {1, 2, 3, 4, 5, 6, 7, 8, 9} S = {{4,5,6}, {1,2,4,5},{4,5,7,8}, {1,4,7}} f1 f2 f3 f4 = 3 Set Cover Problem http://en.wikipedia.org/wiki/Set_cover_problem 1 2 3 4 5 6 7 8 9 This is NP-complete problem Contrast Graph Matrix The greedy feature selection process can approximate the optimal index with K features within a ratio of 1 − 1/e. [D. S. Hochbaum, editor. “Approximation Algorithms for NP-Hard Problems”, 1997]
cIndex-Basic 1 2 3 4 5 6 7 8 9 • Time complexity : O(|F0||D||L|) ----> O(|F0||Ds||L|) • Space usage : O(|F0||D||L|) ----> O(|F0||D| + |F0||L|) Reduced by sampling Reduced by virtualization
Hierarchical Indexing Models cIndex-BottomUp cIndex-TopDown
Index Maintenance • “ostrich” strategy • Stick with the same set of selected features and the same hierarchical structure built • Sampling strategy • Periodically take small samples (Fs, Ds, Ls) and construct sample index Is • if performance of legacy index I is much worse than Is then reconstruct index • else take ostrich strategy
Experiments (1) • Chemical Descriptor Search • a model graph database usually includes a set of fundamental substructures, called descriptors. These descriptors, shared by specific groups of known molecules, often indicate particular chemical and physical properties. Given a molecule, fast searching for its “descriptor” substructures can help researchers to quickly predict its attributes. • Comparison • SCAN • FB • Extract features using gIndex and apply features as Basic Framework • cIndex-Basic, cIndex-BottomUp, cIndex-TopDown
Experiments (2) Subgraph Isomorphism Test Numbers Query Processing Time Index Maintenance Effectiveness of Data Space Reduction
Experiments (3) Performance of Hierarchical Indices Scalability of Hierarchical Indices Index Size