200 likes | 293 Views
Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks. Jacob Scott, Trey Ideker, Richard M. Karp, Roded Sharan. RECOMB 2005. Outline. Motivation Theoretical foundations Biological extensions Implementation Validation techniques Results from yeast.
E N D
Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks Jacob Scott, Trey Ideker, Richard M. Karp, Roded Sharan RECOMB 2005
Outline • Motivation • Theoretical foundations • Biological extensions • Implementation • Validation techniques • Results from yeast
Motivation • Post-genomics, want to understand organisms’ protein-protein interaction network • Model network as a probabilistic graph, with edge weights representing probabilities • Interested in protein signaling cascades • Show up as simple paths in the graph • Want to find biologically interesting paths efficiently • Score paths, with high scores reflecting importance • Extended graph algorithms provide speed • Automated modelling of signal transduction networks as baseline (Steffen et al 2002)
Theoretical Foundation • Finding long, simple paths is NP-Hard • Reduce from TSP • Once we find these paths, want the best (lightest) ones • Need for paths to be simple is what drives hardness • Color-Codingis a randomized, dynamic-programming based algorithm for finding paths of fixed length • Developed by Alon et al (1995) • Randomly color graph and require paths be colorful (exactly one vertex of each color) • Number of colors = length of paths • A colorful path is always simple
Color-Coding • Colorful paths can be found with dynamic programming • Key point: a colorful path of length k contains a colorful path of length k-1. • Store path information at each node for each subset of k colors • Only 2k color subsets, rather than O(nk) node subsets • Runtime is O(2kkm) << O(knk) brute force • Space is O(2kn) << O(knk) brute force
B C H G E D F A A H G E D F B C Coloring Example • Two different colorings on toy graph, k=3 • In coloring I, W(A,RGB) is built C->BC->ABC • In coloring II, W(A,RGB) is built G->BG->ABG • ABC is not colorful in coloring II II I
Monte Carlo Details • A colorful path is simple, but a simple path may not be colorful under a given coloring • Solution: run multiple independent trials • After one trial, for paths of length k,
Adding Biology • Color-Coding gives an algorithmic basis, now introduce biologically motivated extensions • Can set the start or end of path by type • E.g. screening by Gene Ontology categories • Can force the inclusion of a protein on the path by giving it a unique color • Using counters, can specify “path must contain between x and y proteins of a given type” • Computational cost multiplicative in y per counter
Adding Biology - Segmented Paths • Pathways may be ordered • Signaling pathways going from the membrane, to nuclear proteins and finally transcription factors • Assign each protein an integer label based on biological information, build path out of ordered sequences of labeled proteins • Now only need to constrain color collisions among proteins with the same label • If path length is about equally split among labels, probability of correct coloring rises • Modifications allow for inability to assign proteins to unique labels
Adding Biology - More Structures • Modifications to the Color-Coding recurrence allow for the discovery beyond simple paths • Example: Two-terminal series-parallel graphs • Capture parallel signaling pathways Example two-terminal series-parallel graph
Generating Edge Weights • So far, have glossed over how weights (probabilities) on the protein graph are assigned • Here, use our previous work, generate logistic function of three variables (for a pair of proteins) • Number of times interaction between them was experimental observed • Pearson correlation coefficient of expressions (for corresponding genes) • Their small world clustering coefficient • Used training data from MIPS (gold standard) for training our relative weighting • Taking log of weights makes path score additive
Application • Tested our simple path implementation with the yeast interaction network • ~4,500 vertices, ~14,500 edges • Based on interaction data from Database of Interacting Proteins (Feb 2004) • Runtimes varied from minutes (length 8) to under two hours (length 10) • Much faster than brute force for longer paths (14x for paths of length 9) • Focus on paths from membrane proteins to transcription factors
Validation Techniques • Three methods of validation • Two statistical • Functional enrichment p-value based on how many proteins in the path are similar (by GO category) • Weight p-value compares weights of paths to those found when the protein graph undergoes random degree-preserving shuffling • Lastly, search for expected pathways • MAP-Kinase, ubiquitin-ligation
MAP-Kinase and Ubiquitin-Ligation • Concentrated on three MAPK pathways (same as Steffen et al) • Pheromone response • Filamentous growth • Cell wall integrity • Looked for shorter (length 4-6) ubiquitin-ligation pathways • Started at a cullin, ended at an F-Box • High functional enrichment under ubiquitin GO category
Statistical Results (CDFs) • 100 best paths of length 8 @ 99.9% success • 100 normal, 2000 random paths used for weight p-value
STE2/3 STE4/18 CDC42 STE20 STE11 STE7 FUS3 DIG1/2 STE12 STE3 AKR1 STE4 CDC24 BEM1 STE5 STE7 KSS1 STE12 Cell wall integrity pathway in yeast MAPK Recovery Results MID2 RHO1PKC1 BCK1 MKK1/2 SLT2 RLM1 B) Best path of length 7 found from MID2 to RLM1 MID2 ROM2 RHO1PKC1MKK1SLT2 RLM1 C) Pheromone response signaling pathway in yeast D) Best path of length 9 found from STE2/3 to STE12
STE2/3 STE4/18 CDC42 STE20 STE11 STE7 FUS3 DIG1/2 STE12 Pheromone response pathway assembly network Additional MAPK Recovery Results REM1 STE50 FAR1 GPA1 CDC24 STE3 STE4/18 STE12 FUS3 STE7 DIG1/2 CDC42 STE11 AKR1 KSS1 STE5 Pheromone response signaling pathway in yeast
Conclusion • Presented efficient, color-coding based algorithms for finding simple paths • Added biological extensions, other structures • Integrated our well-founded reliability scores • Applied our algorithms to yeast • Shown 60% of discovered pathways were significantly enriched • Recovered known MAP-Kinase, ubiquitin-ligation pathways
Simple vs. Segmented CDFs Segmented: 72% Simple: 54% p-value (functional enrichment)
References • Steffen, M., Petti, A., Aach, J., D’haeseleer, P., Church, G.: Automated modelling of signal transduction networks. BMC Bioinformatics 3 (2002) 34–44 • Alon, N., Yuster, R., Zwick, U.: Color-coding. J. ACM 42 (1995) 844–856