410 likes | 531 Views
Seminar in Bioinformatics. An efficient algorithm for detecting frequent subgraphs in biological networks. Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Bioinformatics, Vol. 20, Sup. 1, 2004, pages i200-i207. Presented by: Royi Ronen. Abstract. Motivation
E N D
Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Bioinformatics, Vol. 20, Sup. 1, 2004, pages i200-i207. Presented by: Royi Ronen
Abstract • Motivation • Network interaction data is abundant • Analyzing this data is important • Problems are close to the subgraph isomorphism problem – Hard! • Results • An efficient algorithm for detecting frequently occurring patterns in bio-network • The algorithm simplifies the subgraph isomorphism problem to a different, tractable, problem with biological applications • Mining the KEGG database yields positive empiric results
Outline • Introduction • Model • Approach: Graph Mining • Related Work • Formalism for metabolic pathways • The Algorithm • Discussion and Empiric Results • Conclusion • Future Work
Introduction • Experimental data relating to biological sequences (that are highly available and accessible) play an important role in tasks such as discovering common sequences and motifs • Biomolecular interaction data are abstracted as graphs • Example: A hypergraph can represent a metabolic pathway where nodes represent compounds • Can be reduced to a directed graph where nodes are enzymes and edges relate them
Introduction • Key problems in this context: • Aligning multiple graphs • Finding frequently occurring sub-graphs in a collection of graphs • A solution can lead to the understanding of • Motifs of cellular interactions • Evolutionary relationships • Differences between networks in different organisms • Patterns of gene regulation
Introduction • In the paper • Finding frequently occurring subgraphs in a collection of graphs, each representing a metabolic pathway • Close to the NP-Hard subgraph isomorphism problem • End of story? • No! • The problem can be simplified and made tractable and still capture the biological information • Nodes will be “uniquely labeled”, according to the represented enzyme • Experimental results: discovering “interesting” patterns from KEGG takes seconds
Outline • Introduction ☺ • Model • Approach: Graph Mining • Related Work • Formalism for metabolic pathways • The Algorithm • Discussion and Empiric Results • Conclusion • Future Work
Metabolic Pathways • Oldest kind of biological network • Group the reactions that belong to a process • Publicly available (e.g., KEGG) • Chemical compounds are linked to each other by a product-substrate relationship • In a hypergraph • Nodes are compounds • A hyperedge is a reaction (or an enzyme) • Hyperedge direction is important to distinguish between substrates and products a c b
Metabolic Pathways • Simplification: • Regular graph, nodes represent enzymes, an edge connects enzyme a to enzyme b iff a’s product is b’s substrate (more accurately, if such a relation exists) • Edges may be labeled by the compound that relates a to b. • A specific enzyme may appear more than once in the same pathway, but we consider merged nodes at the price of losing temporal information • Various problems related to understanding the molecular interaction in the cell can be solved using graph related frameworks, mostly to provide a means to investigate units with well defined functionality • Paper focus: Mining pathways for frequent connected subgraphs, which is important because functional modules are expected to repeat among several pathways or organisms (or both) com. a b
Outline • Introduction ☺ • Model ☺ • Approach: Graph Mining • Related Work • Formalism for metabolic pathways • The Algorithm • Discussion and Empiric Results • Conclusion • Future Work
Related Work • Subgraph isomorphism • Unlabeled version. Hardness usually “tackled” by ordering nodes and edges for efficient processing • Labeled Version. Easier, suitable for biological networks • Frequent itemset mining • Multiple sets of items (transactions) from domain D are given • Itemset X implies itemset Y with c confidence if c% of sets containing X also contain Y • X→Y has support s if s% of the sets contain X and Y
Graph Formalism for Metabolic Pathways • A Metabolic Pathway is a triplet, P(M,Z,R) • M, a set of metabolites • Z, a set of enzymes • R, a set of reactions, where each reaction r is associated with • A set of enzymes Z(r) from Z • A set of substrates S(r) from M • A set of products T(r) from M enzyme metabolite
Graph Formalism for Metabolic Pathways • A Graph G(V,E) for P(M,Z,R) is defined • For every enzyme zi in Z - a node vi exists • (vi,vj) in E iff zj consumes the product of zi • Example: enzyme metabolite enzyme
Mining Metabolic Pathways • The Problem: Given a collection of n graphs and a support threshold ε, find all maximal connected subgraphs that are contained in at least εn of the graphs • The support of a subgraph which appears in n’ graphs is n’/n. • A frequent subgraph is maximal if it is not contained by another frequent subgraph
Subgraph Isomorphism Simplified • Nodes are labeled by enzyme identifiers • Only edges are needed to define a graph. Their labels conceptually identify the nodes • Edges are items, uniquely specified by labels which refer to enzymes • The problem can therefore be reduced to mining frequent itemset • The graph G1 here is {ab,ac,de} • Connectivity has to be considered
Subgraph Homeomorphism Simplified • A connected edgeset corresponds to a connected subgraph • A unique edge is a set of two node labels • A set of unique edges ES={e1, e2 …, ek} is called connected iff every subset ES’ of ES shares at least one node with the remaining edges ES\ES’. • Connection to frequent itemset mining • Input Graphs correspond to transactions • Connected edgesets correspond to itemsets • Approach: build frequent sets bottom up (small to large) • Edge addition preserves connectivity
Subgraph Homeomorphism Simplified • Through the search, only connected edgesets are considered • Captures the connected nature of pathways • Avoiding redundancy coming from considering the same sets in different order is important.
The Algorithm • The procedure is invoked for each frequent edge ei – Mine({}, {ei}, N(ei), {e1,e2,…,ek}) • The support is embodied in the “if frequent” statement • Example: consider 5 enzymes, a, b, c, d and e, which participate (vacuously or not) in 4 pathways G1,G2,G3,G4. • We mine with support = ¾.
Example ab, ac and de are the only frequent edges Mine({}, {ab}, N(ab), {ab,ac,bd,de,ce} Mine({}, {ac}, N(ac), {ab,ac,bd,de,ce} Mine({}, {de}, N(de), {ab,ac,bd,de,ce} {ab,ac},{de} are the frequent subgraphs
Example Mining development: {ab,ac},{de} are the frequent maximal subgraphs
Polynomial Bound • The paper does not prove complexity, but only justifies “efficiency” in an empiric way • We show a polynomial bound for time complexity • Determining which are the frequent edges can be done using sorting • Determining the neighbors of an edge is linear (requires one pass) • In every level of the recursion, the algorithm extends a frequent subgraph with a new frequent edge. This is a linear number of procedures • Each such procedure can be done in polynomial time complexity, where n is the number of edges in the input
Outline • Introduction ☺ • Model: ☺ • Approach: Graph Mining ☺ • Related Work ☺ • Formalism for metabolic pathways ☺ • The Algorithm ☺ • Discussion and Empiric Results • Conclusion • Future Work
Empiric Results • The bold subgraph was mined and appears in 29% of the organisms in KEGG • The solid subgraph appears in 19.3% • The entire graph appears in 14.2% Glutamate
Empiric Results Alanine-aspartate Pyrimidine 32.1%, 19.2%, 11.5% 25.6%, 21.8%, 15.4%
Empiric Results • Run time results for Pentium 4, 2 GHz, 0.5 GB of RAM • Sub pathway of 16 edges discovered in 3 sec. • The entire graph appears in 14.2%
Outline • Introduction ☺ • Model: ☺ • Approach: Graph Mining ☺ • Related Work ☺ • Formalism for metabolic pathways ☺ • The Algorithm ☺ • Discussion and Empiric Results ☺ • Conclusion • Future Work
Conclusion • Framework for mining biological networks • Graph simplification without losing biological meaning • Efficient graph mining • Good response times
Outline • Introduction ☺ • Model: ☺ • Graph Mining ☺ • Related Work ☺ • Formalism for metabolic pathways ☺ • The Algorithm ☺ • Discussion and Empiric Results ☺ • Conclusion ☺ • Future Work
Future Work • Adding flexibility for capturing biologically meaningful info and concepts, such as probabilistic methods • Probabilistic models for investigating the significance of discovered patterns (but unlike the previous case, probability does not model biology) • Approximate matching rather than exact • What is an approximation in this case? Suitable definition needed
Seminar in Bioinformatics Pairwise Local Alignment of Protein Interaction Networks Guided by Models of Evolution Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Journal of Comp. Biology, 13(2), 182-199, 2006. Presented by: Royi Ronen
The Problem • Protein-Protein-Interaction networks are modeled as graphs • A PPI network is an undirected graph (V,E) • Elements in V represent proteins • Elements in E represent pairs which interact • The paper solves the problem of aligning two graphs (rather than many)
Homology Function S(•,•) • Consider two Graphs: G(U,E), H(V,F) • For each pair from the union of V and U, S assigns a score: • If the pair belongs to the same (a different) species, the confidence that they are paralogous (orthologous). 0 is the lowest value • Values of S are determined by an algorithm out of the scope of the paper (INPARANOID) • Some definitions: • Match: A conserved interaction between orthologous pairs • Mismatch: A lack of interaction between a pair whose orthologs interact • Duplication: Paralogous proteins (tend to diverge in the long run)
Proposed Solution • Every pair of node subsets induces an alignment {M,N,D} which is associated with a score • M - Pairs of edges, with positive S values to nodes, which exist in both graphs. Each associated with a positive score • N - Pairs of edges, with positive S values to nodes, which exist in one graph but not in the other. Each associated with a negative score • D - Pairs of nodes from the same graph with positive S. Each associated with a negative score • The total score is the sum of all the scores, and we wish to find alignment with locally maximal scores
Proposed Solution • An algorithm is proposed in order to avoid considering all possible subsets • The heuristics tries to expand a set so that its scores is made higher • Rings a bell?
Experimental Results • Using this alignment method and a scoring algorithm for S(•,•) called INPARANOID, PPI networks of Human and Mouse were aligned • Data taken from the DIP Database • Details: • Homo Sapiens - 1369 interaction between 1065 proteins • Mus Musculus – 286 interactions between 329 proteins
Experimental Results • INPARANOID discovered 237 ortholog clusters • 305 matched interactions were discovered; 205 mismatches, 536 duplications in Human; 149 mismatches, 384 duplications in Mouse. • Examples: • Conserved subnet with one-way mismatches • Conserved subnet with two-way mismatches • Duplications
Example 1 • Graphs aligned • Biological meaning • Similarity and differences between the species • Insight on evolutionary events
Example 2 • Another graph alignment result with local maximum score
Example 3 • Instance of duplication between mouse and human • The regulator regulates homologs