Seminar in Bioinformatics

Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Bioinformatics, Vol. 20, Sup. 1, 2004, pages i200-i207. Presented by: Royi Ronen

Abstract • Motivation • Network interaction data is abundant • Analyzing this data is important • Problems are close to the subgraph isomorphism problem – Hard! • Results • An efficient algorithm for detecting frequently occurring patterns in bio-network • The algorithm simplifies the subgraph isomorphism problem to a different, tractable, problem with biological applications • Mining the KEGG database yields positive empiric results

Outline • Introduction • Model • Approach: Graph Mining • Related Work • Formalism for metabolic pathways • The Algorithm • Discussion and Empiric Results • Conclusion • Future Work

Introduction • Experimental data relating to biological sequences (that are highly available and accessible) play an important role in tasks such as discovering common sequences and motifs • Biomolecular interaction data are abstracted as graphs • Example: A hypergraph can represent a metabolic pathway where nodes represent compounds • Can be reduced to a directed graph where nodes are enzymes and edges relate them

Introduction • Key problems in this context: • Aligning multiple graphs • Finding frequently occurring sub-graphs in a collection of graphs • A solution can lead to the understanding of • Motifs of cellular interactions • Evolutionary relationships • Differences between networks in different organisms • Patterns of gene regulation

Introduction • In the paper • Finding frequently occurring subgraphs in a collection of graphs, each representing a metabolic pathway • Close to the NP-Hard subgraph isomorphism problem • End of story? • No! • The problem can be simplified and made tractable and still capture the biological information • Nodes will be “uniquely labeled”, according to the represented enzyme • Experimental results: discovering “interesting” patterns from KEGG takes seconds

Outline • Introduction ☺ • Model • Approach: Graph Mining • Related Work • Formalism for metabolic pathways • The Algorithm • Discussion and Empiric Results • Conclusion • Future Work

Metabolic Pathways • Oldest kind of biological network • Group the reactions that belong to a process • Publicly available (e.g., KEGG) • Chemical compounds are linked to each other by a product-substrate relationship • In a hypergraph • Nodes are compounds • A hyperedge is a reaction (or an enzyme) • Hyperedge direction is important to distinguish between substrates and products a c b

Metabolic Pathways • Simplification: • Regular graph, nodes represent enzymes, an edge connects enzyme a to enzyme b iff a’s product is b’s substrate (more accurately, if such a relation exists) • Edges may be labeled by the compound that relates a to b. • A specific enzyme may appear more than once in the same pathway, but we consider merged nodes at the price of losing temporal information • Various problems related to understanding the molecular interaction in the cell can be solved using graph related frameworks, mostly to provide a means to investigate units with well defined functionality • Paper focus: Mining pathways for frequent connected subgraphs, which is important because functional modules are expected to repeat among several pathways or organisms (or both) com. a b

Outline • Introduction ☺ • Model ☺ • Approach: Graph Mining • Related Work • Formalism for metabolic pathways • The Algorithm • Discussion and Empiric Results • Conclusion • Future Work

Related Work • Subgraph isomorphism • Unlabeled version. Hardness usually “tackled” by ordering nodes and edges for efficient processing • Labeled Version. Easier, suitable for biological networks • Frequent itemset mining • Multiple sets of items (transactions) from domain D are given • Itemset X implies itemset Y with c confidence if c% of sets containing X also contain Y • X→Y has support s if s% of the sets contain X and Y

Graph Formalism for Metabolic Pathways • A Metabolic Pathway is a triplet, P(M,Z,R) • M, a set of metabolites • Z, a set of enzymes • R, a set of reactions, where each reaction r is associated with • A set of enzymes Z(r) from Z • A set of substrates S(r) from M • A set of products T(r) from M enzyme metabolite

Graph Formalism for Metabolic Pathways • A Graph G(V,E) for P(M,Z,R) is defined • For every enzyme zi in Z - a node vi exists • (vi,vj) in E iff zj consumes the product of zi • Example: enzyme metabolite enzyme

Mining Metabolic Pathways • The Problem: Given a collection of n graphs and a support threshold ε, find all maximal connected subgraphs that are contained in at least εn of the graphs • The support of a subgraph which appears in n’ graphs is n’/n. • A frequent subgraph is maximal if it is not contained by another frequent subgraph

Subgraph Isomorphism Simplified • Nodes are labeled by enzyme identifiers • Only edges are needed to define a graph. Their labels conceptually identify the nodes • Edges are items, uniquely specified by labels which refer to enzymes • The problem can therefore be reduced to mining frequent itemset • The graph G1 here is {ab,ac,de} • Connectivity has to be considered

Subgraph Homeomorphism Simplified • A connected edgeset corresponds to a connected subgraph • A unique edge is a set of two node labels • A set of unique edges ES={e1, e2 …, ek} is called connected iff every subset ES’ of ES shares at least one node with the remaining edges ES\ES’. • Connection to frequent itemset mining • Input Graphs correspond to transactions • Connected edgesets correspond to itemsets • Approach: build frequent sets bottom up (small to large) • Edge addition preserves connectivity

Subgraph Homeomorphism Simplified • Through the search, only connected edgesets are considered • Captures the connected nature of pathways • Avoiding redundancy coming from considering the same sets in different order is important.

The Algorithm

The Algorithm • The procedure is invoked for each frequent edge ei – Mine({}, {ei}, N(ei), {e1,e2,…,ek}) • The support is embodied in the “if frequent” statement • Example: consider 5 enzymes, a, b, c, d and e, which participate (vacuously or not) in 4 pathways G1,G2,G3,G4. • We mine with support = ¾.

Example ab, ac and de are the only frequent edges Mine({}, {ab}, N(ab), {ab,ac,bd,de,ce} Mine({}, {ac}, N(ac), {ab,ac,bd,de,ce} Mine({}, {de}, N(de), {ab,ac,bd,de,ce} {ab,ac},{de} are the frequent subgraphs

Example Mining development: {ab,ac},{de} are the frequent maximal subgraphs

Polynomial Bound • The paper does not prove complexity, but only justifies “efficiency” in an empiric way • We show a polynomial bound for time complexity • Determining which are the frequent edges can be done using sorting • Determining the neighbors of an edge is linear (requires one pass) • In every level of the recursion, the algorithm extends a frequent subgraph with a new frequent edge. This is a linear number of procedures • Each such procedure can be done in polynomial time complexity, where n is the number of edges in the input

Outline • Introduction ☺ • Model: ☺ • Approach: Graph Mining ☺ • Related Work ☺ • Formalism for metabolic pathways ☺ • The Algorithm ☺ • Discussion and Empiric Results • Conclusion • Future Work

Empiric Results • The bold subgraph was mined and appears in 29% of the organisms in KEGG • The solid subgraph appears in 19.3% • The entire graph appears in 14.2% Glutamate

Empiric Results Alanine-aspartate Pyrimidine 32.1%, 19.2%, 11.5% 25.6%, 21.8%, 15.4%

Empiric Results • Run time results for Pentium 4, 2 GHz, 0.5 GB of RAM • Sub pathway of 16 edges discovered in 3 sec. • The entire graph appears in 14.2%

Outline • Introduction ☺ • Model: ☺ • Approach: Graph Mining ☺ • Related Work ☺ • Formalism for metabolic pathways ☺ • The Algorithm ☺ • Discussion and Empiric Results ☺ • Conclusion • Future Work

Conclusion • Framework for mining biological networks • Graph simplification without losing biological meaning • Efficient graph mining • Good response times

Outline • Introduction ☺ • Model: ☺ • Graph Mining ☺ • Related Work ☺ • Formalism for metabolic pathways ☺ • The Algorithm ☺ • Discussion and Empiric Results ☺ • Conclusion ☺ • Future Work

Future Work • Adding flexibility for capturing biologically meaningful info and concepts, such as probabilistic methods • Probabilistic models for investigating the significance of discovered patterns (but unlike the previous case, probability does not model biology) • Approximate matching rather than exact • What is an approximation in this case? Suitable definition needed

NEXT PAPER (IN BRIEF)…

Seminar in Bioinformatics Pairwise Local Alignment of Protein Interaction Networks Guided by Models of Evolution Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Journal of Comp. Biology, 13(2), 182-199, 2006. Presented by: Royi Ronen

The Problem • Protein-Protein-Interaction networks are modeled as graphs • A PPI network is an undirected graph (V,E) • Elements in V represent proteins • Elements in E represent pairs which interact • The paper solves the problem of aligning two graphs (rather than many)

Homology Function S(•,•) • Consider two Graphs: G(U,E), H(V,F) • For each pair from the union of V and U, S assigns a score: • If the pair belongs to the same (a different) species, the confidence that they are paralogous (orthologous). 0 is the lowest value • Values of S are determined by an algorithm out of the scope of the paper (INPARANOID) • Some definitions: • Match: A conserved interaction between orthologous pairs • Mismatch: A lack of interaction between a pair whose orthologs interact • Duplication: Paralogous proteins (tend to diverge in the long run)

Proposed Solution • Every pair of node subsets induces an alignment {M,N,D} which is associated with a score • M - Pairs of edges, with positive S values to nodes, which exist in both graphs. Each associated with a positive score • N - Pairs of edges, with positive S values to nodes, which exist in one graph but not in the other. Each associated with a negative score • D - Pairs of nodes from the same graph with positive S. Each associated with a negative score • The total score is the sum of all the scores, and we wish to find alignment with locally maximal scores

Proposed Solution • An algorithm is proposed in order to avoid considering all possible subsets • The heuristics tries to expand a set so that its scores is made higher • Rings a bell?

Experimental Results • Using this alignment method and a scoring algorithm for S(•,•) called INPARANOID, PPI networks of Human and Mouse were aligned • Data taken from the DIP Database • Details: • Homo Sapiens - 1369 interaction between 1065 proteins • Mus Musculus – 286 interactions between 329 proteins

Experimental Results • INPARANOID discovered 237 ortholog clusters • 305 matched interactions were discovered; 205 mismatches, 536 duplications in Human; 149 mismatches, 384 duplications in Mouse. • Examples: • Conserved subnet with one-way mismatches • Conserved subnet with two-way mismatches • Duplications

Example 1 • Graphs aligned • Biological meaning • Similarity and differences between the species • Insight on evolutionary events

Example 2 • Another graph alignment result with local maximum score

Example 3 • Instance of duplication between mouse and human • The regulator regulates homologs

Seminar in Bioinformatics

Seminar in Bioinformatics

Presentation Transcript

Databases in Bioinformatics

Tools in bioinformatics

Seminar in Bioinformatics, Winter 2011 Network Motifs

Clouds in Bioinformatics

Algorithms in Bioinformatics

WOMEN IN BIOINFORMATICS SEMINAR SERIES

Seminar in BioInformatics

Bioinformatics Seminar 13/11/07

Seminar in Bioinformatics (236818)

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms .

Seminar in structural bioinformatics

Seminar in Bioinformatics (236818)

Algorithms in Bioinformatics

Introduction in Bioinformatics

Seminar in bioinformatics

Databases in Bioinformatics

Amit Meshulam Bioinformatics Seminar Technion, Spring 06

Basics in bioinformatics

Careers In Bioinformatics

Seminar in bioinformatics

Workshop in Bioinformatics

Seminar in structural bioinformatics