240 likes | 487 Views
Graph Indexing Techniques. Seoul National University IDB Lab. Kisung Kim 2011. 3. 23. Outline. Category of graph queries Querying in collection DB References. Category of Graph Queries: Matching Type. Exact subgraph matching
E N D
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23
Outline • Category of graph queries • Querying in collection DB • References
Category of Graph Queries: Matching Type • Exact subgraph matching • Find graphs in DB which have all components of the query graph • Similarity subgraph matching • Find graphs in DB which have some components of the query graph • Similarity measure is needed • Super graph matching • Find graphs in DB which are contained in the query graph Query graph Exact subgraph SimilaritySubgraph Query graph
Category of Graph Queries: Target DB • Collection DB: large number of small graphs • e.g. Chemical compounds • Retrieval component • IDs of graphs which contain matching parts • Large graphs: small number of large graphs • e.g. Social network, RDF graph • Retrieval component • All matching subgraphs Query graph Query graph G1 G4 Results: matching subgraphs G5 G7 Results: graph ID list G2 G3 G6 G1, G3, G5 Querying Collection DB Querying Large Graphs
Query Processing in Collection DB • Processing flow • Verification uses usual pair-wise subgraph isomorphism algorithm • Most of techniques focus on filtering techniques • The cost of verification is high • To reduce the number of verification execution Candidategraph set Answer Graphs Verification Filtering Query
Query Processing in Large Graphs • Processing flow • Focus on node indexing • To reduce search space • Use structural information of nodes • Build subgraph by joining candidate nodes • Join methods are not relatively researched • Optimization using join ordering Candidatenode sets Building subgraphs Answer subgraphs Indexsearch Query
Outline • Category of graph queries • Querying in collection DB • References
GraphGrep(1/2) [Shasha et al. PODS’02] • First work adopts the filtering-and-verification framework • Path-based index • Fingerprint of database • Enumerate the set of all paths(length <= L) of all graphs in DB • For each path, the number of occurrences in each graphs are stored in hash table B D C B C B C A B C A B A B E Index g3 g1 g2
GraphGrep(2/2): Query Processing • Filtering • Make the fingerprint of query q • Hash all paths (length <= L) of q • Compare the fingerprint of the query with the fingerprint of database • Discard a graph whose value in fingerprint is less than the value in query fingerprint • Verification • Check subgraph isomorphism tests B D C B C B C A B C A B A B E Candidates = {g1, g3} g3 g1 g2 Query AB:1 AC:1 BAC:1 B Index Verification A C
gIndex(1/6) [Yan et al., SIGMOD’04] • Path-based approach has week points • Path is too simple: structural information is lost • There are too many paths: the set of paths in a graph database usually is huge • Solution • Use graph structure instead of path as the basic index feature c c c c Cannot Filter Any Graphs In Database c c c c c c c c c c c c c c c c c c c c c c c c c c c c Query Paths in Query Graph Sample Database
gIndex(2/6): Frequent Fragment • The number of graph structure is large • Index only frequent subgraphs • support(g) • The number of graphs in D (graph database), where g is a subgraph • minSup • Minimum support threshold • Index a fragment, g only if support(g) ≥ minSup • Size-increasing support • Frequent fragments are increasing as the size of a fragment increases • Low minSup for small fragments, high minSup for large fragment
gIndex(3/6): Frequent Fragment Size=2 Size=3 Size=1 Size=4 A A A A B A A A A B A A A B B B F=3 F=3 B B B A F=2 F=1 A A F=1 A B A B B F=4 F=3 B B A B A A A A B B B B B A B A A F=3 F=1 F=2 F=1 A B B F=1 B B A B A A B A A F=2 B A B B F=1 F=2 A B B minSup=1 minSup=1 minSup=2 minSup=2
gIndex(4/6): Discriminative Fragment • Redundant fragment • Fragments whose indexed graphs are also indexed by its subgraphs • We don’t need to include redundant fragments • Discriminative fragment • Fragments which are not redundant Size=2 Size=3 A A A A g1 g3 f1 f3 A A B A A B B B B Df1={g1, g2, g3} B g4 A A g2 A f2 Df3={g2, g3}=Df1∩Df2 A B B Df2={g2, g3, g4} B B A B B
gIndex(5/6): gIndex Tree • Use graph serialization method • For fast graph isomorphism checking during index search • DFS coding [Yan et al. ICDM’02] • Translate a graph into a unique edge sequence • gIndex Tree • Prefix tree which consists of the edge sequences of discriminative fragments • Record all size-n discriminative fragments in level n • Black nodes discriminative fragments • Have ID lists: the ids of graphs containing fi • White nodes redundant fragments; for Apriori pruning Level 0 v0 e1 X X a a Level 1 f1 b b v1 e2 X X a a b b Level 2 v2 v3 Z Y Z Y f2 e3 <(v0,v1),(v1,v2),(v2,v0),(v1,v3)> … f3 gIndex Tree DFS Coding
gIndex(6/6): Searching • Searching process • Given a query q, enumerate all q’s fragments (size <= maxSize) • Locate the fragments in gIndex tree • Intersect the id lists associated with the fragments • Apriori pruning • Generating every fragment is inefficient • If a fragment is not in gIndexTree, we need not check its super-graphs any more • Redundant fragments need to be recorded for Apriori pruning Query <e1, e2, e3, e4, e5> Level 0 e1 Level 1 Fragments <e1> <e1, e2> <e1, e2, e3> <e1, e2, e3, e4> stop <e2> … f1 e2 Level 2 f2 e3 … f3 gIndex Tree
Grafil(1/4) [Yan et al., SIGMOD’05] • Subgraph similarity search • Feature-based approach • Similarity search using relaxed queries • Relax a query by deletion of k edges • Missed edges incur missed features • Main question • What is the maximum missed features() when relaxing a query with k missed edges? Subgraph exact search Query G1 {u1, u2, …, un} G2 Subgraph similarity search … Gn Feature Vector {v1, v2, …, vn}
Grafil(2/4): Feature Misses Relaxed Queries FeatureMiss Miss 1 edges =4 7-4=3 Query =3 7-3=4 fa fb fc fa fa fa fa fb fb fb fb fc fc fc fc 0 1 0 1 2 0 1 1 3 4 2 2 7-3=4 =3 Maximum Feature Misses mmax=4 Features
Grafil(3/4): Feature Miss Estimation • Problem • Given a query Q and a set of features contained in Q, if the relaxation ratio is given, what is the maximal number of features that can be missed? • Use edge-feature matrix • Find the maximum number of columns that can be hit by k rows • K: the number of missing edges in Q • Classic maximum coverage problem (set k-cover) • Proved NP-complete e1 fa fb fc e2 e3 Query Features Edge-Feature Matrix
Grafil(4/4): Feature Conjugation • Compensate the misses of a feature by occurrences of another features in G • Using all the features together in one filter would deteriorate the filtering performance • Solution • Use multiple filters • Feature set selection Query Features Graph fb fa C fa fb C A A A C (3-0)+0=3 ≤ mmax A A 3 4 A B A A mmax=4 A B A B B Relaxation Ratio = 1
References • [Shasha et al., PODS’02] Dennis Shasha, Jaso T. L. Wang, RosalbaGiugno, Algorithmics and Applications of Tree and Graph Searching. PODS, 2002. • [Yan et al., SIGMOD’04] Xifeng Yan, Philip S. Yu, Jiawei Han, Graph Indexing: A Frequent Structure-based Approach. SIGMOD, 2004. • [Yan et al., SIGMOD’05]Xifeng Yan, Philip S. Yu, Jiawei Han, Substructure Similarity Search in Graph Databases. SIGMOD, 2005. • [Tian and Patel, ICDE’08]YuanyuanTian , Jignesh M. Patel. TALE: A Tool for Approximate Large Graph Matching. ICDE, 2008. • [He and Singh, SIGMOD’08]HuahaiHe, Ambuj K. Singh. Graphs-at-a-time: query language and access methods for graph databases. SIGMOD, 2008. • [Zhao and Han, VLDB’10]PeiziangZhao, Jiawei Han. On Graph Query Optimization in Large Networks. VLDB, 2010. • [He and Singh, ICDE’06]Huahai He, Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries. ICDE, 2006 • [Shang et al., VLDB’08]Haichuan Shang, Ying Zhang, Xuemin Lin, Jeffrey Xu Yu, Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism. VLDB, 2008