Mining Frequent Subgraphs

Mining Frequent Subgraphs COMP 790-90 Seminar Spring 2007

1L06 Overview • Introduction • Finding recurring subgraphs from graph databases. • gSpan • FFSM

p2 p5 s1 q1 y c b y y y b b s2 p1 q2 x a a a x y y y y d b b b p4 s3 q3 p3 (S) (P) (Q) Labeled Graph • We define a labeled graphG as a five element tuple G = {V, E, V, E, } where • V is the set of vertices of G, • E  V V is a set of undirected edges of G, • V(E) are set of vertex (edge) labels, •  is the labeling function: V V and E  Ethat maps vertices and edges to their labels.

x b b a b p2 p5 s1 q1 y y c b b b y y y y y b b a b b y s2 p1 q2 x x a a a a x a x a y y y y y y d b b b b b b p4 s3 q3 p3 (S) (P) (Q) Frequent Subgraph Mining Input: A set GD of labeled undirected graphs  = 2/3 Output: All frequent subgraphs (w. r. t. ) from GD.

Finding Frequent Subgraphs • Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least  graphs. • Isomorphic subgraphs are considered the same subgraph. • Apriori approaches • Generation of subgraph candidates is complicated and expensive. • Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

gSpan • DFS without candidate generation • Relabels graph representation to support DFS. • Discovers all frequent subgraphs without candidate generation or pruning. • DFS Representation • Map each graph to a DFS code (sequence). • Lexicographically order the codes. • Construct a search tree based on the lexicographic order.

Depth-First Search Tree (a) (b) (c) (d)

DFS Codes • Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if: • i1 = i2 && j1 < j2 • i1 < j1 && j1 = i2 • code(G,T) = edge sequence of ei < ei+1 (a) (b) (c) (d)

DFS Lexicographic Order • ∂ = code(G∂,T∂) = (a0,a1,…,am) • ß = code(Gß,Tß) = (b0,b1,…,bn) • ∂ ≤ ß iff (1) or (2): • (1) • (2) • Minimum DFS code • The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G. • Graphs A and B are isomorphic if min(A) = min(B).

DFS Codes: Parents and Children • If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b): • ß is the child of ∂. • ∂ is the parent of ß. • A valid DFS code requires that b grows from a vertex on the rightmost path.

DFS Code Trees • Organize DFS code nodes as parent-child. • Pre-order traversal follows DFS lexicographic order. • If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

gSpan • D is the set of all graphs. • S is the result set. Algorithm 1: GraphSet_Projection(D,S) 1: sort labels in D by frequency 2: remove infrequent vertices and edges 3: relabel remaining vertices and edges 4: S’ = all frequent 1-edge graphs in D 5: sort S’ in DFS lexicographic order 6: S = S’ 7: foreach edge e in S’ do 8: s = graph defined by e 9: s.D = subgraphs in D containing e 10: Subgraph_Mining(D,S,s) 11: D = D - e 12: if |D| < minSup 13: break Subprocedure 1: Subgraph_Mining(D,S,s) 1: if s != min(s) 2: return 3: S = S U {s} 4: s’ = +1-edge children of s in s.D 5: foreach child c of s’ do 6: if support(c) ≥ minSup 7: Subgraph_Mining(Ds,S,c)

Runtime: Synthetic Runtime (sec)

1000 100 Runtime (sec) 10 1 0 5 10 15 20 25 30 Support Threshold (%) Runtime: Chemical Apriori (FSG) gSpan

gSpan Advantages • Lower memory requirements. • Faster than naïve FSG by an order of magnitude. • No candidate generation. • Lexicographic ordering minimizes search tree. • False positives pruning. • Any disadvantage?

FFSM: Fast Frequent Subgraph Mining -- An Overview: • How to solve graph isomorphism problem? • A Novel Graph Canonical Form: CAM • How to tackle subgraph isomorphism problem (NP-complete)? • Incrementally maintained embeddings • How to enumerate subgraphs: • An Efficient Data Structure: CAM Tree • Two Operations: CAM-join, CAM-extension.

a b y b x b y x b y 0 d 0 y c 0 y 0 c 0 0 0 y 0 d y y 0 0 a M3 M1 p2 p5 a y c b y y b p1 y x b x a y 0 0 y d y d b 0 y 0 c 0 p4 p3 M2 (P) Adjacency Matrix • Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. • Every off-diagonal entry in the lower triangle part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge. 1for an undirected graph, the upper triangle is always a mirror of the lower triangle

b x b y 0 d 0 y 0 c y y 0 0 a a y b M3 y x b 0 y c 0 0 0 y 0 d M1 Code • A Code of n  n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order: M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00a a y b y x b 0 0 y d 0 y 0 c 0 M2 • TheCanonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.

a a M1 a y b y b y x b a a 0 y c y x b y b y b a 0 0 y c 0 0 y 0 d y 0 b y x b y b 0 M5 M2 M3 M4 M6 MP Submatrix • For an m  m matrix A, an n  n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M. • We define a CAM is connected iff the corresponding graph is connected. • Theorem I: A CAM’s MP submatrix is CAM • Theorem II: A connected CAM’s MP submatrix is connected

b b a y d x b y b y 0 c a a a y b y b a a a a a a b y b 0 y d y 0 b y y y y b b b b y y b b x b y x b y 0 c 0 y y y x x x 0 b b b b y 0 x 0 b b 0 y 0 d 0 0 0 0 y y y y 0 0 0 0 c d c c 0 0 y y 0 0 d d a a y b y b 0 x b 0 x b p2 p5 y 0 0 y c 0 0 y d c b y p1 a a a x a y y y b b b y 0 0 y x x 0 b b b y d b 0 0 0 y y y 0 0 0 c c d p4 0 0 0 0 0 0 y y y 0 0 0 d c d p3 (P) CAM Tree: Subgraphs b d c a b b y c x b a b a a y y b b y b x b 0 y c y 0 d 0 0 x x b b

a b a b y b x b a a y b y b 0 x b y 0 b a y b y x b p2 p5 s1 q1 y c b y y y b b s2 p1 q2 x a a a x y y y y d b b b p4 s3 q3 p3 (S) (P) (Q) CAM Tree: Frequent Subgraphs = 2/3

How to Enumerate Nodes in a CAM Tree? • Two operations to explore CAM tree: • CAM-Join • CAM-Extension • Augmenting CAM tree with Suboptimal CAMs • Objectives: • none false dismissal • no redundancy • Plus: We want to this efficiently!

a b y b y c y x b j e e j j e e a a b b b b y b x b y b x b x b x b 0 y c 0 y d 0 y d y 0 c y 0 d 0 y c j j e e j j a a a a b b y b y b y b y b x b x b y x b y x b y 0 c y 0 d y x b y x b p2 p5 y 0 0 y c 0 0 y d 0 y 0 d 0 y 0 c 0 y 0 c 0 y 0 d c b y j j p1 a a x a y y b y b y y x b y x b d b 0 y 0 c 0 0 y d p4 p3 0 0 y 0 d 0 0 y 0 c (P) Suboptimal Tree We define a Suboptimal CAM as a matrix that its MP submatrix is a CAM. d b c a b b a y d x b y b

Summary • Theorem: For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.

Experimental Study • Predictive Toxicology Evaluation Competition (PTE) • Contains: 337 compounds • Each graph contains 27 nodes and 27 edges on average • NIH DTP Anti-Viral Screen Test (DTP CA/CM) • Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI). • We formed a dataset contains CA (423) and CM (1083). • Each graph contains 25 nodes and 27 edges on average

Performance (PTE) Support Threshold (%) Support Threshold (%)

Performance (DTP CACM) Support Threshold (%) Support Threshold (%)

Mining Frequent Subgraphs