1 / 79

Mining, Indexing & Searching Graphs in Large Data Sets

Mining, Indexing & Searching Graphs in Large Data Sets. Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj In collaboration with Xifeng Yan (IBM Watson), Philip S. Yu (IBM Watson), Feida Zhu (UIUC), Chen Chen (UIUC).

antranig
Download Presentation

Mining, Indexing & Searching Graphs in Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining, Indexing & Searching Graphs in Large Data Sets Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj In collaboration with Xifeng Yan (IBM Watson), Philip S. Yu (IBM Watson), Feida Zhu (UIUC), Chen Chen (UIUC)

  2. Research Papers Covered in this Talk • X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining, ICDM'02 • X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns, KDD'03 • X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based Approach, SIGMOD'04 (also in TODS’05, Google Scholar: ranked #1 out of 63,300 entries on “Graph Indexing”) • X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 (also in TODS’06) • F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, PAKDD'07 (Best Student Paper Award) • C. Chen, X. Yan, P. S. Yu, J. Han, D. Zhang, and X. Gu, “Towards Graph Containment Search and Indexing”, VLDB'07, Vienna, Austria, Sept. 2007

  3. Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Co-author network An Internet Web

  4. Why Graph Mining and Searching? • Graphs are ubiquitous • Chemical compounds (Cheminformatics) • Protein structures, biological pathways/networks (Bioinformactics) • Program control flow, traffic flow, and workflow analysis • XML databases, Web, and social network analysis • Graph is a general model • Trees, lattices, sequences, and items are degenerated graphs • Diversity of graphs • Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) • Complexity of algorithms: many problems are of high complexity!

  5. Outline • Mining frequent graph patterns • Constraint-based graph pattern mining • Graph indexing methods • Similairty search in graph databases • Graph containment search and indexing

  6. Graph Pattern Mining • Frequent subgraphs • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold • Applications of graph pattern mining • Mining biochemical structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classification, clustering, comparison, and correlation analysis

  7. Example: Frequent Subgraphs Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (1) (2)

  8. Frequent Subgraph Mining Approaches • Apriori-based approach • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) • PATH: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) • Pattern growth-based approach • MoFa, Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) • Gaston: Nijssen and Kok (KDD’04)

  9. Properties of Graph Mining Algorithms • Search order • breadth vs. depth • Generation of candidate subgraphs • apriori vs. pattern growth • Elimination of duplicate subgraphs • passive vs. active • Support calculation • embedding store or not • Discover order of patterns • path  tree  graph

  10. Apriori-Based Approach (k+1)-edge k-edge G1 G G2 G’ … G’’ Gn JOIN

  11. Pattern Growth-Based Span and Pruning 1-edge ... 2-edge ... ... If redundant, prune it! ... 3-edge G1 ... ... PRUNED ...

  12. gSpan(Yan and Han ICDM’02) Right-Most Extension Theorem: Completeness The Enumeration of Graphs using Right-most Extension is COMPLETE

  13. e0: (0,1) e1: (1,2) e2: (2,0) e3: (2,3) e4: (3,1) e5: (2,4) DFS Code • Flatten a graph into a sequence using depth first search 0 1 2 4 3

  14. Graph Pattern Explosion Problem • If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property • An n-edge frequent graph may have 2n subgraphs • Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%

  15. Closed Frequent Graphs • Motivation: Handling graph pattern explosion problem • Closed frequent graph • A frequent graph G is closed if there exists no supergraph of G that carries the same support as G • If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) • Lossless compression: still ensures that the mining result is complete

  16. CLOSEGRAPH (Yan & Han, KDD’03) A Pattern-Growth Approach (k+1)-edge At what condition, can we stopsearching their children i.e., early termination? G1 k-edge G2 G If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. … Gn

  17. Experimental Result • The AIDS antiviral screen compound dataset from NCI/NIH • The dataset contains 43,905 chemical compounds • Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI

  18. Discovered Patterns 20% 10% 5%

  19. Number of Patterns: Frequent vs. Closed CA Number of patterns minimum support

  20. Runtime: Frequent vs. Closed CA runtime (sec) minimum support

  21. Outline • Mining frequent graph patterns • Constraint-based graph pattern mining • Graph indexing methods • Similairty search in graph databases • Graph containment search and indexing

  22. Constraint-Based Graph Pattern Mining F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, PAKDD'07 There are often various kinds of constraints specified for mining graph pattern P, e.g., max_degree(P) ≥ 10 diameter(P) ≥δ Most constraints can be pushed deeply into the mining process, thus greatly reduces search space Constraints can be classified into different categories Different categories require different pushing strategies 2007-5-23 32

  23. Pattern Pruning vs. Data Pruning • Pattern Pruning Pruning a pattern saves the mining associated with all the patterns that grow out of this pattern, which is DP • Data Pruning Data pruning considers both the pattern P and a graph G ∈ DP, and data pruning saves a portion of DP DP is the data search space of a pattern P. ST,P is the portion of DP that can be pruned by data pruning.

  24. Pruning Properties Overview Pruning property: A property of the constraint that helps prune either the pattern search space or the data search space. Pruning Pattern Search Space Strong P-antimonotonicity Weak P-antimonotoniciy Pruning Data Search Space Pattern-separable D-antimonotonicity Pattern-inseparable D-antimonotonicity 34

  25. Pruning Pattern Search Space Strong P-antimonotonicity A constraint C is strong P-antimonotone if a pattern violates C, all of its super-patterns do so too E.g., C: “The pattern is acyclic” Weak P-antimonotoniciy A constraint C is weak P-antimonotone if a graph P (with at least k vertices) satisfies C, there is at least one subgraph of P with one vertex less that satisfies C E.g., C: “The density ratio of pattern P ≥ 0.1”, i.e., A densely connected graph can always be grown from a smaller densely connected graph with one vertex less 35

  26. Pruning Data Space (I): Pattern-Separable D-Antimonotonicity • Pattern-separable D-antimonotonicity A constraint C is pattern-separable D-antimonotone if a graph G cannot make P satisfy C, then G cannot make any of P’s super-patterns satisfy C • C: “the number of edges ≥ 10”, or “the pattern contains a benzol ring”. • Use this property: recursive data reduction • A graph is pruned from the data search space for pattern P if G cannot satisfy this C

  27. M G P =M(P,G) Pruning Data Space (II): Pattern-Inseparable D-Antimonotonicity • Pattern-inseparable D-antimonotonicity • The tested pattern is not separable from the graph • E.g., “the vertex connectivity of the pattern ≥ 10” • Idea: “put pattern P back to G” • Embed the current pattern P into each G ∈ DP • Compute by a measure function M, for all supergraphs P’ such that P ⊂ P’ ⊂ G, an upper/lower bound M(P,G) of the graph property • This bound serves as a necessary condition for the existence of a constraint-satisfying supergragh P’. We discard G if this necessary condition is violated.

  28. Graph Constraints: A General Picture 38

  29. Outline • Mining frequent graph patterns • Constraint-based graph pattern mining • Graph indexing methods • Similairty search in graph databases • Graph containment search and indexing

  30. query graph graph database Graph Search: Querying Graph Databases • Querying graph databases: • Given a graph database and a query graph, find all graphs containing this query graph

  31. (a) (b) (c) Query graph Scalability Issue • Sequential scan • Disk I/O • Subgraph isomorphism testing • An indexing mechanism is needed • DayLight: Daylight.com (commercial) • GraphGrep: Dennis Shasha, et al. PODS'02 • Grace: Srinath Srinivasa, et al. ICDE'03 Sample database

  32. Indexing Strategy Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Remarks • Index substructures of a query graph to prune graphs that do not contain these substructures

  33. Framework • Two steps in processing graph queries Step 1. Index Construction • Enumerate structuresin the graph database, build an inverted index between structures and graphs Step 2. Query Processing • Enumerate structuresin the query graph • Calculate the candidate graphs containing these structures • Prune the false positive answers by performing subgraph isomorphism test

  34. Cost Analysis Query Response Time Disk I/O time Isomorphism testing time Graph index access time Size of candidate answer set Remark: make |Cq| as small as possible

  35. Path-Based Approach Sample database (a) (b) (c) Paths 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C, ... 3-length: ... Built an inverted index between paths and graphs

  36. Problems of Path-Based Approach Sample database (a) (b) (c) Query graph Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b).

  37. gIndex: Indexing Graphs by Data Mining • Our methodology on graph index: • Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database • Prune redundant frequent structures to maintain a small set of discriminative structures • Create an inverted index between discriminative frequent structures and graphs in the database

  38. IDEAS: Indexing with Two Constraints discriminative (~103) frequent (~105) structure (>106)

  39. Why Discriminative Subgraphs? Sample database • All graphs contain structures: C, C-C, C-C-C • Why bother indexing these redundant frequent structures? • Only index structures that provide more information than existing structures (a) (b) (c)

  40. Discriminative Structures • Pinpoint the most useful frequent structures • Given a set of structures f1, f2, …, fn and a new structure x, we measure the extra indexing power provided by x, P (x|f1, f2, …, fn), where fi is contained in x • When P is small enough, x is a discriminative structure and should be included in the index • Index discriminative frequent structures only • Reduce the index size by an order of magnitude

  41. Why Frequent Structures? • We cannot index (or even search) all of substructures • Large structures will likely be indexed well by their substructures • Size-increasing support threshold minimum support threshold support size

  42. Experimental Setting • The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds • Query graphs are randomly extracted from the dataset. • GraphGrep: maximum length (edges) of paths is set at 10 • gIndex: maximum size (edges) of structures is set at 10

  43. Experiments: Index Size # OF FEATURES DATABASE SIZE

  44. Experiments: Answer Set Size # OF CANDIDATES QUERY SIZE

  45. Experiments: Incremental Maintenance Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database

  46. Outline • Mining frequent graph patterns • Constraint-based graph pattern mining • Graph indexing methods • Similairty search in graph databases • Graph containment search and indexing

  47. Structure Similarity Search • CHEMICAL COMPOUNDS (a) caffeine (b) diurobromine (c) viagra • QUERY GRAPH

  48. Some “Straightforward” Methods • Method1: Directly compute the similarity between the graphs in the DB and the query graph • Sequential scan • Subgraph similarity computation • Method 2: Form a set of subgraph queries from the original query graph and use the exact subgraph search • Costly: If we allow 3 edges to be missed in a 20-edge query graph, it may generate 1,140 subgraphs

  49. Index: Precise vs. Approximate Search • Precise Search • Use frequent patterns as indexing features • Select features in the database space based on their selectivity • Build the index • Approximate Search • Hard to build indices covering similar subgraphs—explosive number of subgraphs in databases • Idea: (1) keep the index structure (2) select features in the query space

More Related