370 likes | 527 Views
Joint analysis of regulatory networks and expression profiles. Ron Shamir School of Computer Science Tel Aviv University April 2013. Sources: Igor Ulitsky and Ron Shamir. Identification of Functional Modules using Network Topology and High-Throughput Data. BMC Systems Biology 1:8 (2007).
E N D
Joint analysis of regulatory networks and expression profiles Ron Shamir School of Computer Science Tel Aviv University April 2013 Sources: Igor Ulitsky and Ron Shamir. Identification of Functional Modules using Network Topology and High-Throughput Data. BMC Systems Biology 1:8 (2007). Igor Ulitsky and Ron Shamir. Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics Vol. 25 no. 9 1158-1164 (2009) . 1
Outline Background Joint network and expression profiles Matisse Cezanne
DNA protein RNA One program Its output The hard disk transcription translation
DNA Microarrays / RNA-seq Simultaneous measurement of expression levels of all genes / transcripts. Perform 105-109 measurements in one experiment Allow global view of cellular processes. The most important biotechnological breakthroughs of the last /current decade http://www.biomedcentral.com/1471-2105/12/323/figure/F2
The Raw Data genes experiments • Entries of the Raw Data matrix: expression levels. • Ratios/absolute values/… • expression patternfor each gene • Profile for eachexperiment • /condition/sample/chip • Needs normalization!
A. Maron, R. Sharan Bioinformatics 03 A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan, Y. Shiloh, R. Elkon BMC Bioinformatics 05 EXPression ANalyzer and DisplayER Ulitsky et al. Nature Protocols 10 Clustering Identify clusters of co-expressed genes CLICK, KMeans, SOM, hierarchical Promoter analysis Analyze TF binding sites of co-regulated genes PRIMA Biclustering Identify homogeneous submatrices SAMBA Visualization Function. enrichment GO, TANGO microRNA function inference: FAME • http://acgt.cs.tau.ac.il/expander
Networks of Protein-protein interactions (PPIs) Large, readily available resource Representation: Network with nodes=proteins/genes edges=interactions • Analysis methods: • Global properties • Motif content analysis • Complex extraction • Cross-species comparison
Potential inroad into pathways and function Can the network help to improve the analysis?
Goal • Challenge: Detect active functional modules: connected subnetwork of proteins whose genes are co-expressed • “Where is the action in the network in a particular experiment?”
13 Ron Shamir, RNA Antalia, April 08
Ulitsky & Shamir BMC Systems Biology 07
Interaction High expression similarity Modular Analysis for Topology of Interactions and Similarity SEts • Input: Expression data and a PPI network • Output: a collection of modules • Connected PPI subnetworks • Correlated expression profiles http://acgt.cs.tau.ac.il/matisse
Probabilistic model • Event Mij: i,j are mates = highly co-expressed • P(Sij|Mij) ~ N(m , 2m) • P(Sij|Mij) ~ N(n , 2n) • H0: U is a set of unrelated genes • H1: U is a module = connected subnetwork with high internal similarity • Ri: gene i transcriptionally regulated • m: fraction of mates out of module gene pairs that are transcriptionally regulated • m= P(Mij| Ri Rj, H1) • pm: fraction of mates out of all gene pairs that are transcriptionally regulated
Probabilistic model (2) • Is connected gene set U a module? Assuming pair indep: • Define mij= m P(Ri)P(Rj) • Define nij= pm P(Ri)P(Rj). • Likelihood ratio Pr(Data|H1)/Pr Data|H0) • Taking log: sum of terms ij:
Probabilistic model - summary • Similarities:mixture of two Gaussians • For a candidate group U, the likelihood ratio of originating from a module or from the background is • Module score = Gene group likelihood ratio = sum over all the gene pairs • Find connected subgraphs U with high WU
Complexity Finding heaviest connected subgraph: NP hard even without connectivity constraints (+/- edge weights) Devised a heuristic for the problem
MATISSE workflow • Seed generation • Greedy optimization • Significance filtering
Finding seeds • Three seeding alternatives tested • All alternatives build a seed and delete it from the network • Building small seeds around single nodes: • Best neighbors • All neighbors • Approximating the heaviest subgraph • Delete low-degree nodes and record the heaviest subnetwork found
Greedy optimization • Simultaneous optimization of all the seeds • The following steps are considered: • Node addition • Node removal • Assignment change • Module merge
Front vs. Back nodes • Only a fraction of the genes (front nodes) have meaningful similarity values • MATISSE can link them using other genes (backnodes). • Back nodes correspond to: • Unmeasured transcripts • Post-translational regulation • Partially regulated pathways
Test case: Yeast osmotic shock Network: 65,990 PPIs & protein-DNA interactions among 6,246 genes Expression: 133 experimental conditions – response of perturbed strains to osmotic shock (O’Rourke & Herskowitz 04) Front nodes: 2,000 genes with the highest variance
Pheromone response subnetwork Back Front
Performance comparison % of modules with category enrichment at p< 10-3 % annotations enriched at p<10-3 in modules
F. Müller, L. Laurent, D. Kostka, I. Ulitsky, R. Williams, C. Lu, I. Park, M. Rao, P. Schwartz, N. Schmidt, J. LoringNature 08 ~150 human stem cell lines of diverse types profiled using microarrays Clustered profiles into groups Adjusted Matisse to seek subnetworks that characteristic to each group Focused analysis on pluripotent stem cells Application to stem cells
Pluripotent stem cells network Highlights the key protein machinery underlying pluripotency
Ulitsky & Shamir Bioinformatics 2009
Accounting for PPI confidence PPI-based analysis is made difficult by abundant false positive / negative interactions Various methods can assign confidence (probability) to individual edges Idea: seek modules that are connected with high probability Ulitsky & Shamir Bioinformatics, 2009
CEZANNE: (Co-Expression Zone ANalysis using NEtworks) Edge probability p(e)Edge weight –log(1-p(e)) For any WU, ≥1 edge connects W with U\Wwith probability q (e.g. 0.95) The weight of the minimum cut of U is at least -log(1-q) Algorithm: among the subnets whose minimum cut exceeds -log(1-q) find the one with the maximum co-expression score P({A},{B,C,D})=1-0.3*0.3=0.91 A P({A,B,D},{C})=0.994 0.7 C minimum cut 0.7 0.8 0.9 D B P({A,B},{C,D})=0.94 P({A,C,D},{B})=0.94
DNA damage response in S. cerevisiae 47 DNA Damage Response expression profiles(Gasch et al., 01) Front nodes: 2,074 genes with at least two-fold expression change Network and confidence values: purification enrichment (PE) scores (Collins et al. 07)
DNA damage response modules Cytoplasmic ribosome biogenesis Suggests SWS2 a novel member Novel pathway enriched with actin-localized proteins; Supported in other datasets; Similar deletion phenotypes Proteasome Mitochondrial ribosome – small subunit Spliceosome Novel actin-localized pathway? Mitochondrial ribosome – large subunit Ribonuclease P Trehalose biosynthesis PKA Hsp90
Comparison with prior work Combined measure of sensitivity (% of annotations enriched) and specificity (% of modules enriched) with p<0.001 Clustering expression & network (Hanisch et al., 2002) Expression similarity + confident network connectivity Expression similarity + network connectivity Clustering of only expression data
Summary • Algorithms using co-expression + networks to detect functionally coherent modules • Accommodate both sparse and dense subnetworks • Subnetworks linked to osmotic shock and DNA damage • A general framework for confident connectivity in PPI networks • The next steps: • Co-expression is not the only interesting way to utilize GE data • Scaling to complex human datasets