NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs Mike Yuan

Outline of this presentation • Introduction to PPI • Introduction to Graph Mining • Related work • Problem statement • Details of the NeMoFinder algorithm • Summary • References

Protein Interactions A Protein may interact with: • Other proteins • Nucleic Acids • Small molecules

Finding Protein Partners

Motivation • Important for biological functions • To understand the function of a protein, we need to find its interacting partners

Graph Theory Vertex (node) Cycle Edge -5 Directed Edge (Arc) Weighted Edge 10 7 Molecular interaction networks are mapped as graphs

The protein protein interaction network…

Graph mining • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Applications: • Graph Indexing • Similarity Search • Classification and Clustering

Why Graph Mining? • Graphs are ubiquitous • Chemical compounds (Cheminformatics) • Protein structures, biological pathways/networks (Bioinformactics) • Program control flow, traffic flow, and workflow analysis • XML databases, Web, and social network analysis • Graph is a general model • Trees, lattices, sequences, and items are degenerated graphs • Complexity of algorithms: many problems are of high complexity

Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Co-author network Internet

Graph Pattern Mining • Frequent subgraphs • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold • Applications of graph pattern mining • Mining biochemical structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Example: Frequent Subgraphs GRAPH DATASET (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) (1) (2)

Frequent Subgraph Mining Approaches • Apriori-based approach: if a graph is frequent, all of its subgraphs are frequent ─ the Apriori property • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) • Pattern growth approach • MoFa, Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) • Gaston: Nijssen and Kok (KDD’04)

Problem Statement • PPI network G=(V,E) _ each vertex represents a unique protein _ each edge between vA and vB indicates there is an interaction between A and B • Network motif _frequently occurring subgraph pattern in a network • fg is the number of occurrences of a subgraph g, g is repeated if fg>F. • fg_randi is the frequency of g in a randomized network Grandi, for 1 ≤ i ≤ N, N is the number of the randomized networks. sg is the number of times fg≥fg_randi, g is unique if its sg >S. • Network motif discovery algorithm

Problem Statement (cont) • Motivation of NeMoFinder- existing research has following limitations: _Number of network motifs candidates increases exponentially _Interesting network motifs are repeated and unique and Apirori algorithms are not applicable _The graph isomorphism problem is an NP problem • NeMoFinder _ a network motif discovery algorithm to discover repeated and unique meso-scale network motifs in a large PPI network

Key procedures • Example graph G • Find repeated trees • Use repeated trees to partition a network into a set of graphs • Introduce graph cousins to facilitate the candidate generation and frequency counting processes.

Step1. Discover Repeated Subgraphs • Step1.1 find repeated size-k trees • Eg. Size 2 to size 5 trees t2 t3 t4_1 t4_2 t5_1 t5_2 t5_3

Step1. discover repeated subgraphs (cont) • ft2 = 7, ft3 = 13, ft4_1 = 6, ft4_2 =17, ft5_1=1, ft5_2 = 5, ft5_3 = 7. • T2 = {t2}, T3 = {t3}, T4 ={t4_1, t4_2} and T5 = {t5_2, t5_3}.

Step 1.2 Use repeated size-k trees to partition graph • Occurrences of t4_1 in G.

Step 1.2 Use repeated size-k trees to partition graph (cont) • Occurrences of t4_2 in G.

Step1.2 Use repeated size-k trees to partition graph (cont) • Set of graphs GD4 G4_1 G4_2 G4_3 G4_4 G4_5

Step 1.3: perform graph join operation to find repeated size-k graphs • Generate 3-edge subgraphs from size-4 trees t4_1 h1 h2 t4_2 h3 h4 h5

Step 1.3: perform graph join operation to find repeated size-k graphs (cont) • Examples for graph join operations for subgraphs t4_1 h2 g1_2 t4_2 h3 g1_1 • fg1_1 = 2 and fg1_2 = 5

Step 1.3: perform graph join operation to find repeated size-k graphs (cont) • Use subgraphs obtained to generate subgraphs g1_2 h6 h7 • Graph join operations for subgraphs g1_2 h6 g2 • f(g2)<2, algorithm stops

Algorithm1 NeMoFinder 1: Input: G - PPI network;N - Number of randomized networks;K - Maximal network motif size;F - Frequency threshold;S - Uniqueness threshold; 2: Output: U - Repeated and unique network motif set; 3: D ← ∅; 4: for motif-size k from 3 to K do 5: T ← FindRepeatedTrees(k); 6: GDk ← GraphPartition(G, T) 7: D ← D  T; 8: D’ ← T; 9: i ← k; 10: while D’≠∅ and i ≤ k × (k − 1)/2 do 11: D’ ← FindRepeatedGraphs(k,i,D’); 12: D ← D D’; 13: i ← i + 1; 14: end while 15: end for Step1: Discover repeated subgraphs Step 1.1: Find repeated size-k trees Step 1.2: use repeated size-k trees to partition graph Step 1.3: perform graph join operation to find repeated size-k graphs

Algorithm1 NeMoFinder (cont) 16: for counter i from 1 to N do 17: Grand ← RandomizedNetworkGeneration(); 18: for each g  D do 19: GetRandFrequency(g,Grand); 20: end for 21: end for 22: U ← ∅; 23: for each g D do 24: s ← GetUniqunessValue(g); 25: if s ≥ S then 26: U ← U  {g}; 27: end if 28: end for 29: return U; Step 2: Determine subgraph frequency in randomized networks Step 3: Compute uniqueness of subgraphs

Algorithm Steps (cont) • Step 2: Determine subgraph frequency in randomized networks _Generate randomized networks Grandi(1≤i≤N) _check the frequency of the subgraphs in each of the randomized networks Grandi • Step 3: Compute uniqueness of subgraphs _ Based on frequencies in the input PPI network and the randomized networks _fg_randiis the frequency of g in a randomized network Grandi, for 1 ≤ i ≤ N, N is the number of the randomized networks. sg is the number of times fg≥fg_randi, g is unique if its sg >S.

Find repeated subgraphs Algorithm 2 FindRepeatedGraphs(k, i,D’) 1: Input: D’ - Set of repeated subgraphs with k vertices and i − 1 edges; 2: Output: D’’ - Set of repeated subgraphs with k vertices and i edges; 3: C ← CandidateGeneration(k, i, D’); 4: D’’ ← FrequencyCounting(k, i, C); 5: return D’’;

Candidate generation using graph cousins • Represent subgraphs by adjacency matrices • Code(M): a sequence formed by linking the lower triangular entries of M in the following order: m1,1m2,1m2,2…mn,1mn,2…mn,n • Transform adjancy matrix into canonical adjacency matrix (CAM) which has the maximal code • Definition of subCAM of a graph _ A matrix obtained by setting the last edge entry in CAM(g) to 0.

Candidate generation using graph cousins (cont) • Definition of cousin _ Given two subgraphs g and h, if subCAM(g) = subCAM(h), then h is a cousin of g. • Three types of cousin relationship between g and h: _ Type I: Direct Cousin h is isomorphic to a subgraph g’ which has the same number of vertices and edges as g, and g’ ≠g; _ Type II: Twin Cousin h is isomorphic to subgraph g; _Type III: Distant Cousin h is a disconnected subgraph.

0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 Candidate generation using graph cousins (cont) • Adjacency matrices for the graphs in figure 6 t4_1 h1 h2

Candidate generation using graph cousins (cont) • Adjacency matrices for the graphs in figure 6 t4_2 h3 h4h5

Candidate generation using graph cousins (cont) • Observations of above example _h1 is a type 1 direct cousin of t4_1 _h2 is a type 3 distant cousin of t4_1 _h3 is a type 2 twin cousin of t4_2 _h4 is a type 1 direct cousin of t4_2 _h5 is a type 3 distant cousin of t4_2

Candidate generation using graph cousins (cont) Algorithm 3 CandidateGeneration(k, i,D’) 1: Input: D’ - Set of repeated subgraphs with k vertices and i − 1 edges; 2: Output: C - Set of candidates with k vertices and i edges; 3: C ← ∅; 4: for each g  D do 5: H ← GetCousin(g); 6: for each h  H do 7: g’ ← join(g, h); 8: C ← C  {g}; 9: end for 10: end for 11: return C; Step 1: Find set of cousins Step2: join g with cousins to form new subgraph

Frequency counting • Leveraging properties of the different types of cousins _Lx: set of graphs in GDk embedding x _If type of h=type I direct cousin of g, g’ is subgraph obtained by g and h, then Lg’= Lg ∩ Lh, fg’= |Lg ∩ Lh| _if type of h = Type III distant cousin,then fg’= |Lg ∩ Lh| _if type of h = Type II twin cousin then fg’ =CheckAllOccurances(g) _Lt4_1 ={G4_1,G4_2,G4_3,G4_5}, Lh2 = {G4_1,G4_2,G4_3,G4_4,G4_5} Lg1_2= Lt4_1∩ Lh2 ={G4_1,G4_2,G4_3,G4_5}, fg1_2=4>2

Frequency counting Algorithm 4 FrequencyCounting(k, i,C) 1: Input: GDk - Set of graphs generated by partitioning G with size-k repeated trees; C - Set of subgraph candidates with k vertices and i edges; F - Frequency threshold; 2: Output: D’’ - Set of repeated subgraphs with k vertices and i edges; 3: D’’ ← ∅; 4: for each g’  C do 5: Get the join parameter of g’: g and h; 6: Lg ← set of graphs in GDk embedding g; 7: Lh ← set of graphs in GDk embedding h; 8: if fg < F or fh < F then 9: fg’ ← 0; 10: else if type of h = Type I direct cousin then 11: fg’ ← |Lg ∩ Lh| 12: else if type of h = Type III distant cousin then 13: fg’ ← |Lg ∩ Lh| 14: else if type of h = Type II twin cousin then 15: fg’ ← CheckAllOccurances(g); 16: end if 17: if fg’ > F then 18: D’’ ← D’’  {g’}; 19: end if 20: end for 21: return D’’; Case h is direct cousin Case h is distant cousin Case h is twin cousin

Summary • NemoFinder-an efficient network motif discovery algorithm to discover larger-sized repeated and unique network motifs in PPI networks. • Use repeated trees to partition network into graphs • Graph cousins for candidate generation and frequency counting

References (1) • T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02 • C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 • D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05. • J.Chen, W.Hsu, M.Lee,NeMoFinder: Dissecting genome wide protein-protein interactions with repeated and unique network motifs, Seekiong Ng, SIGKDD 2006 • M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003 • M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02 • C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04 • H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05

References (2) • L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94 • J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04 • J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03 • H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05 • A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00 • C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., 2003. • G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04 • H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03

References (3) • M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004. • T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04 • M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 • M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04 • C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05 • P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04 • S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 • J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04

References (4) • D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and graph searching”, PODS'02 • J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976. • N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from semistructured data”, ICDM'02 • C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base graph databases”, KDD'04 • T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD Explorations, 5:59-68, 2003 • X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02 • X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 • X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”, SIGMOD'04 • X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity Constraints”, KDD'05 • X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 • X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed Distance”, ICDE'06 • M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs

NeMoFinder: Dissecting genome-wide protein-protein intractions with meso-scale network motifs

Presentation Transcript

Methods of Protein Purification

Protein Homology Modelling

Ruminant Protein Nutrition

Addressing Housing and Food Insecurity with Program Income

Recombinant protein production in Eukaryotic cells

Protein 3D-structure analysis

Nuclear Magnetic Resonance (NMR) Data Protein–Protein Docking

Reporting Protein Identifications from MS/MS Results

Protein metabolism

Lecture 4 Protein Function prediction using network concepts Hierarchical Clustering

Chapter 17 From Gene to Protein

Protein Concentration Determination

Protein folding

Protein interactions and Pathways

Protein Structure

Protein – protein interaction

Meso- and Storm-Scale NWP: Scientific and Operational Challenges for the Next Decade

The Protein

From DNA to Protein: Gene Expression

Protein Chemistry Basics