EECS 800 Research Seminar Mining Biological Data

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Graph Data Analysis Overview • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Graph Classification • Graph Clustering • Summary

Why Graph Mining? • Graphs are ubiquitous • Chemical compounds (Cheminformatics) • Protein structures, biological pathways/networks (Bioinformactics) • Program control flow, traffic flow, and workflow analysis • XML databases, Web, and social network analysis • Graph is a general model • Trees, lattices, sequences, and items are degenerated graphs • Diversity of graphs • Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) • Complexity of algorithms: many problems are of high complexity

Graph Everywhere from H. Jeong et al Nature 411, 41 (2001) Yeast protein interaction network Aspirin Internet Co-author network

p5 p2 y c b y p1 x a y y d b p4 p3 G1 q1 s1 s4 y b c y y b s2 q2 a a x y y b b s3 q3 G3 G2 Labeled Graphs • A labeled graph is a graph where each node and each edge has a label.

p5 p2 s1 s4 y y c b b c y y s2 p1 x a a y y y q1 b d b y b s3 p4 p3 q2 G3 G1 a x g2 g3 y y y c b a b g1 q3 G2 G Pattern Matching • A graph G is subgraph isomorphic to a graph G’, denoted by G  G’, if • there exists a 1-1 mapping from nodes in G to G’ such that node labels, edges, and edge labels are preserved with the mapping. • A pattern is a graph. Pattern Gmatches G’ if G  G’ • Goccurs in G’ if G  G’. • With a label set, a graph space is a collection of graphs whose labels are from the set.

Graph Pattern Mining • Frequent subgraphs • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold

p5 p2 y c b y p1 x a y y d b p4 p3 G1 y y b c b q1 s1 s4 y b P3 b P2 y b c y y b y + s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b + + P6 P5 s3 q3 + P4 G3 G2 Examples The induced subgraph isomorphism penalizes any unmatched edges  = 2/3 b y f=2/3 f=0/3 f=2/3 f = 1/3 f = 3/3 a y b P1 +: induced frequent subgraphs

p5 p2 y c b y p1 x a y y d b p4 p3 G1 b y y y b a c b y q1 s1 s4 b y b P1 b P2 y b c y y b y s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b P6 P5 s3 q3 P4 G3 G2 Examples Maximal frequent subgraph are ones that none of their supergraphs are frequent Other criteria for selecting subgraphs may be incorporated  = 2/3 f=2/3 ! P3 !: Maximal frequent subgraphs

Example: Frequent Subgraphs GRAPH DATASET (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) (1) (2)

Applications • Mining biomolecular structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Graph Mining Algorithms • Incomplete beam search – Greedy (Subdue) • Inductive logic programming (WARMR) • Graph theory-based approaches • Edge based • Path based • Tree based

SUBDUE (Holder et al. KDD’94) • Start with single vertices • Expand best substructures with a new edge • Limit the number of best substructures • Substructures are evaluated based on their ability to compress input graphs • Using minimum description length (DL) • Best substructure S in graph G minimizes: DL(S) + DL(G\S) • Terminate until no new substructure is discovered

WARMR(Dehaspe et al. KDD’98) • Graphs are represented by Datalog facts • atomel(C, A1, c), bond (C, A1, A2, BT), atomel(C, A2, c) : a carbon atom bound to a carbon atom with bond type BT • WARMR: the first general purpose ILP system • Level-wise search • Simulate Apriori for frequent pattern discovery

Frequent Subgraph Mining Approaches • Edge-based approach • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) • MoFa, Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) • FFSM: Huan, et al. (ICDM’03) • Path-based approach • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • Tree-based approach • Gaston: Nijssen and Kok (KDD’04) • SPIN: Huan, et al. (KDD’04)

Properties of Graph Mining Algorithms • Search order • breadth vs. depth • Generation of candidate subgraphs • apriori vs. pattern growth • Elimination of duplicate subgraphs • passive vs. active • Support calculation • embedding store or not • Discover order of patterns • path  tree  graph

Search DAG • Task: identify all frequently occurring subgraphs from a group of graphs, or a graph database • Support anti-monotonicity • Any supergraph of an infrequent subgraph is infrequent • Known as the Apriori property • Level-wise search • Keep all patterns with the same size in memory (poor memory utilization) • Depth-firstsearch • Better memory utilization • May repeatedly search patterns in the DAG (redundant candidates)

Apriori-Based Approach (k+1)-edge k-edge G1 G G2 G’ … G’’ Gn JOIN

Apriori-Based, Breadth-First Search • Methodology: breadth-search, joining two graphs • AGM (Inokuchi, et al. PKDD’00) • generates new graphs with one more node • FSG (Kuramochi and Karypis ICDM’01) • generates new graphs with one more edge

FSG Algorithm • K = 1 • F1 = all frequent edges • Repeat • K = K + 1; • CK = join(FK-1) • FK = frequent patterns in CK • Until FK is empty

Join: Key Operation • Join(L) =  join(P, Q) for all P, Q  L • Join(P, Q) = {G | P, Q,  G, |G| = |P| + 1, |P| = |Q|} • Two graphs P and Q are joinable if the join of the two graphs produces an non-empty set • Theorem: two graphs P and Q are joinable if P ∩ Q is a graph with size |P| -1 or share a common “core” with size P-1

a e b a a a e e e b b a e a Multiplicity of Candidates • Case 1: identical vertex labels a a + e e b b a a

b a c a a a a a b c a b c + a a a a a a a a a a c a a b a Multiplicity of Candidates • Case 2: Core contains identical labels Core: The (k-1) subgraph that is common between the joint graphs

a a a a a b a a b a a b a + a a a a a b b a a b a b a Multiplicity of Candidates • Case 3: Core multiplicity

PATH (Vanetik and Gudes ICDM’02, ’04) • Apriori-based approach • Building blocks: edge-disjoint path • Identify all frequent paths • Construct frequent graphs with 2 edge-disjoint paths • Construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths • Repeat A graph with 3 edge-disjoint paths

PATH Algorithm • K = 1 • F1 = all frequent paths • Repeat • K = K + 1; • CK = join(FK-1) • FK = frequent patterns in CK • Until FK is empty

Challenges • Graph isomorphism • Two graphs may have the same topology though their layouts are different • Subgraph isomorphism • How to compute the support value of a pattern

Graph Isomorphism • A graph is isomorphic if it is topologically equivalent to another graph

Why Redundant Candidates? • All the algorithms may propose the same candidate several times. • We need to keep track of the identical candidates to • Avoid redundancy in results • Avoid redundant search

An arbitrary set  Intuitions for Graph Normalization A Graph Space A 1-1 mapping A partial order defined on 

GSPAN • A graph normalization is a 1-1 mapping of a graph space to an arbitrary space (usually a string space) • Deal with graph isomorphism using DFS code • Start with a singe edge • Depth first enumeration of a pattern space • Add one edge a time • Yan & Han, ICDM’02

e0: (0,1) e1: (1,2) e2: (2,0) e3: (2,3) e4: (3,1) e5: (1,4) DFS Code • Flatten a graph into a sequence using depth first search 0 1 4 2 3 DFS code: (0, 1, x, a, y), (1, 2, y, b, x), (2, 0, x, a, x) , (2, 3, x, c, z), (3, 1, z, b, y), (1, 4, y, d, z)

DFS Code

DFS Lexicographic Order • Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a  b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Let a = (x0, x1, …, xm) and b = (y0, y1, …, yn),

DFS Code Example • We have γ < β < α

DFS Code Extension • Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any graph G’  G, we have • minDFS(G’) < b • a is a prefix of minDFS(G’) or minDFS(G’) < a • There is a 1-1 mapping from a graph to its minimum DFS code. • For every graph G’, there exists a G such that G’  G and minDFS(G) is a prefix of minDFS(G’)

gSpan Code • Input: A graph database and a support threshold t • Output: All frequent patterns F gSpan: F1 = { frequent node labels}, K=1, gSpan_enumeration (F1, K, F) gSpan_enumeration (FK, K, F) K = K + 1; For each pattern P in Fk C = Candidates(P, K); F = F  C; gSpan_enumeration (C, K, F)

How to Propose Candidates • Generate all supergraphs: • Candidate = {G | P  G, sup(G) >= t, |G| = k} • gSpan method: • Candidate = {G | P  G, sup(G) >= t, |G| = k, minDSP(P) is a prefix of minDSP(G) } • Right-most expansion

p’2 P1 P2 P3 P4 P1 P2 P4 P3 P1 P4 P2 P3 b x p’1 y a x a a a x c b x x 0 b c b p’4 p’3 0 x x x y x b b c M1 M3 M2 (P’) y 0 0 x x 0 0 y x b b c p2 p4 x c b x p1 y a x b p3 (P) Graph Canonical Code in FFSM • The Canonical Code (θ)maps a graph G to a string. Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) < Code(M2):(1, 1, a)(2, 1, x) (2, 2, b) (3, 2, x) (3, 3, c) (4, 1, x) (4, 2, y) (4, 4, b) < Code(M3): (1, 1, a)(2, 2, c) (3, 1, x) (3, 2, x) (3, 3, b) (4, 1, x) (4, 3, y) (4, 4, b) θ(P) = (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) • (i, j, Mi,j)  (k, l, Mk,l) if • i < k, or • i = k, j < l, or • i =k, j = l, Mi,j  Mk,l

An arbitrary set  The Power of Graph Normalization A Graph Space A partial order defined on the graph space A 1-1 mapping A partial order defined on 

MoFa (Borgelt and Berthold ICDM’02) • Extend graphs by adding a new edge • Store embeddings of discovered frequent graphs • Fast support calculation • Also used in other later developed algorithms such as FFSM and GASTON • Local structural pruning

GASTON (Nijssen and Kok KDD’04) • Extend graphs directly • Store embeddings • Separate the discovery of different types of graphs • path  tree  graph • Simple structures are easier to mine and duplication detection is much simpler

Graph Pattern Explosion Problem • If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property • An n-edge frequent graph may have 2n subgraphs • Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%

Closed Frequent Graphs • Motivation: Handling graph pattern explosion problem • Closed frequent graph • A frequent graph G is closed if there exists no supergraph of G that carries the same support as G • If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) • Lossless compression: still ensures that the mining result is complete

CLOSEGRAPH(Yan & Han, KDD’03) A Pattern-Growth Approach (k+1)-edge At what condition, can we stopsearching their children i.e., early termination? G1 k-edge G2 G If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. … Gn

Handling Tricky Exception Cases a b (pattern 1) b a a b c d c d a (graph 1) (graph 2) c d (pattern 2)

Experimental Result • The AIDS antiviral screen compound dataset from NCI/NIH • The dataset contains 43,905 chemical compounds • Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI

Discovered Patterns 20% 10% 5%

Do the Odds Beat the Curse of Complexity? • Potentially exponential number of frequent patterns • The worst case complexty vs. the expected probability • Ex.: Suppose Walmart has 104 kinds of products • The chance to pick up one product 10-4 • The chance to pick up a particular set of 10 products: 10-40 • What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? • Have we solved the NP-hard problem of subgraph isomorphism testing? • No. But the real graphs in bio/chemistry is not so bad • A carbon has only 4 bounds and most proteins in a network have distinct labels

Constrained Patterns • Density • Diameter • Connectivity • Degree • Min, Max, Avg

EECS 800 Research Seminar Mining Biological Data