Network Querying Algorithms

Network Querying Algorithms Roded Sharan Tel-Aviv University

Protein Interactions • Crucial to cell function. • Measured by high-throughput technologies: • yeast two-hybrid • co-immunoprecipitation • Systematic data available for several species.

Network Querying Problem • Sequence comparison allows transferring information a well studied genome to another genome. • Species A • well studied • protein interaction subnetworks defined by extensive experimentation • Species B • less studied • little knowledge of subnetworks • protein interaction network mapped using high-throughput technologies • Can we use the knowledge of A to discover corresponding subnetworks in B (if such exist)?

Isomorphic Alignment Species A Species B Q isomorphic to Q match match match match match match Match of homologous proteins

Homeomorphic Alignment Species A Species B Q homeomorphic to Q match match match match deletion insertion match match Match of homologous proteins and deletion/insertion of degree-2 nodes

Sequence similarity score for matches Penalty for deletions & insertions Interaction reliability scores + + Score = Score of Alignment h(q1,v1) q1 v1 w(v1,v2) h(q2,v2) h(q3,v3) v2 h(q4,v4) del pen ins pen h(q5,v5) h(q6,v6)

Network Querying Problem Query Q • Given a query graph Q and a network G, find the sub-network of G that is: • homeomorphic to Q • aligned with maximal score Network G

Complexity • Network querying problem is NPC by reduction from subgraph isomorphism • Naïve algorithm has O(nk) complexity • n = size of the PPI network, k = size of the query • Intractable for realistic values of n and k • n ~5000, k~10 • Reduction in complexity can be achieved by: • Constraining the network [Pinter et al., Bioinformatics’05] • Constraining the query (fixed parameter algs.) • Allowing vertex repetitions

Path Querying

The Path Query Problem Query Pathway Target Pathway A’ A deletion B C’ C insertion E D’ D

PathBLAST p(v) – sequence similarity q(e) – interaction reliability Kelley et al., PNAS’03

Alignment-Based Approach Pros: • Conceptually simple. • Extensible to general queries (using any network alignment program). Cons: • No general treatment of indels. • Protein Repetitions.

DP-Based Approach • Use dynamic programming (a la sequence alignment): W(i,j) is the maximal score of a partial alignment of query nodes {1…i} that ends at vertex j of the network. match insertion deletion • But this may introduce protein repetitions along the path. Shlomi et al., BMC Bioinformatics ’06; Yang & Sze, JCB’07

Color Coding [AYZ’95] • Problem: Given a graph G=(V,E) and a parameter k, find a simple path of length k in G. • Algorithm: Randomly color vertices with k colors, and find a colorful path (distinct colors). • Complexity: • Colorful path found by DP in O(km2k). • Prob. of success (path is colorful): k!/kke-k. • Overall: m2O(k).

Network Querying with Color Coding randomly color Network Graph query repeat N times high scoring subnetwork DP algorithm Shlomi et al., BMC Bioinformatics ‘06

Yeast & Fly PPI Networks • S. cerevisiae • 4,726 proteins • 15,166 interactions • D. melanogaster • 7,028 proteins • 22,837 interactions

Yeast-Fly Queries • Applied QPath to 271 yeast queries spanning the yeast network. • 63% of queries were matched, most requiring protein indels.

The Scoring Module • Functional enrichment of a matched path correlates with: • Its interaction reliabilities • Its sequence similarities • Its numbers of protein insertions and deletions (anti-correlation). Goal: score matched pathways by their prob. to be functionally enriched. Method: logistic regression on path attributes – PPI reliabilities, sequence similarities, #insertions, #deletions.

Best Matches • 171 best matches identified. • 51% were functionally enriched. • Best matches were significantly more functionally enriched and expression coherent than arbitrary pathways (p<1e-4).

Queries w. Known Pathways Map kinase (yeast) Ubiq. ligation Hedgehog

Pathway homology can be used to predict function! Function Conservation • 69 best matches had an enriched function in both species. • 64% preserved their function; significantly more than the random expectation (31%). • In comparison, sequence best matches preserve their function in only 40% of the cases!

Fly Conserved Pathway Map • Predicted annotations were significantly prevalent. • Map exhibits modularity (cc=0.26).

Querying for Trees & General Graphs

QNet: Tree Queries Network Query has k nodes. Query Dost et al., RECOMB’07

QNet: Tree Queries Network • Query has k nodes. • Randomly color the network with k distinct colors. • Suppose optimal subnetwork is “colorful”. • (all of its vertices colored with distinct colors) • Use the colors to remember the visited nodes.

Finding colorful trees Query Network q1 v1 q2 q3 v2 v4 q4 v3 q5 v6 q6 v7 q7

Querying General Graphs • We have extended the algorithm also for general graphs. • Idea: • Map the original graph into a tree, i.e. tree decomposition. • Solve the querying problem on this tree using DP.

Querying General Graphs Map the original query into a tree using tree-decomposition. node=set of vertices T G u v z vertex

Querying General Graphs Width(T) = size of its largest node – 1. Tree-width(G) = minimum width among all possible tree decompositions of G. T G

Querying General Graphs Network Original query has k nodes and tree-width t. Randomly color the network with k distinct colors. q1 T q2 q3 q2 q3 q4 q5 q5 q4 q8 q6 q7

Querying General Graphs Network Original query has k nodes and tree-width t. Randomly color the network with k distinct colors. q1 T v1 q2 q3 v2 v3 q2 q3 v4 v5 q4 q5 v7 v8 q5 q4 v6 q8 q6 q7 O(n(t+1))

Running time • n=size of network, k=size of query. • Tree queries: • m2O(k). • Tractable for realistic values of m and k. • E.g.: n ~5000, k=9 => 11 seconds • Bounded-tree-width graphs: • t : tree-width • n(t+1)2O(k)

A Tree-Based Heuristic • Extract several spanning trees from the original query. G

A Tree-Based Heuristic • Extract several spanning trees from the original query. • Query each spanning tree in the network.

A Tree-Based Heuristic • Extract several spanning trees from the original query. • Query each spanning tree in the network. • Merge the matching trees to obtain matching graph.

Test 1: Importance of Topology • Motivation: Is sequence similarity enough to find corresponding sub-network? • Queries: • Random tree queries from yeast DIP network [Salwinski, 2004] • Topology perturbed (≤2 ins-dels). • Network: • Yeast PPI • Protein sequences mutated (50-70 percent) • How distant is the result from the original extracted tree?

Test 1: Importance of Topology BLAST QNet Average distance Average distance #ins+#del #ins+#del • Distance = #missing proteins + #extra proteins • Outperforms sequence-based searches.

Test 2: Cross-species Comparison of MAPK Pathways Query from human Match in fly • Motivation: finding conserved pathways. • Query: human MAPK pathway involved in cell proliferation and differentiation. • Network: fly PPI network • ~7K proteins • ~20K interactions • Match: a known fly MAPK pathway involved in dorsal pattern formation.

Test 3: Cross-species Comparison of Protein Complexes • Motivation: conserved protein complexes between yeast and fly. • Queries: • Hand-curated yeast MIPS complexes. • Project onto yeast DIP network. • Extract several spanning trees.

Test 3: Cross-species Comparison of Protein Complexes • Motivation: conserved protein complexes between yeast and fly. • Queries: • Hand-curated yeast MIPS complexes. • Project onto yeast DIP network. • Extract several spanning trees. • Network: • Fly DIP network • Match • Consensus matching graph for each query complex.

Test 3: Cross-species Comparison of Protein Complexes Fly Yeast Cdc28p complex Result: • ~40 of the queries resulted in a match with >1 protein. • 72% of the consensus matches were functionally enriched. • In comparison, 17% of the random trees extracted from network are functionally enriched.

Conclusions • Fixed parameter algorithms for querying paths and trees. • Definition of a match: homeomorphism • General queries: • Yang & Sze JCB’07: branch-and-bound • Alignment-based

Acknowledgments Danny Segal, TAU Trey Ideker, UCSD Richard Karp, ICSI Eytan Ruppin Tomer Shlomi Vineet Bafna, UCSD Banu Dost Nitin Gupta

Network Querying Algorithms