340 likes | 522 Views
Prof. Dr. Tim Beissbarth University of Göttingen Christian Bender German Cancer Research Center (DKFZ) Department Molecular Genome Analysis. CAMDA 2008: Extending pathways with inferred regulatory interactions from microarray data and protein domain signatures. DKFZ-B050 INF 580
E N D
Prof. Dr. Tim Beissbarth University of Göttingen Christian Bender German Cancer Research Center (DKFZ) Department Molecular Genome Analysis CAMDA 2008:Extending pathways with inferred regulatory interactions from microarray data and protein domain signatures DKFZ-B050 INF 580 D-69120 Heidelberg +49 6221 42 4716 t.beissbarth@dkfz.de c.bender@dkfz.de
Contents • Introduction • The CAMDA Dataset: Apoptosis time-course of Affara 2007 • Our goals • Methods • KEGG pathway prediction using InterPro domain signatures • Overrepresentation of predicted pathways • Dynamical Bayesian Networks • Verification of inferred regulations by Ingenuity • Results • Analysis of differential expression and prediction of KEGG pathway memberships with InterPro domain signatures • Overrepresentation test of pathways • Reconstruction of gene regulatory networks with G1DBN • Comparison to Ingenuity based literature network • Summary/Outlook
1a) Introduction: The CAMDA Dataset • Questions: • What is the kinetics of the transriptome change during EC apoptosis? • What is the role of EC apoptosis in blood vessel development? • Experimental setup: as described in Affara et.al. 2007 • Expose primary human umbilical vein EC (HUVEC) to partial survival factor deprivation (SFD) 3 replicates
1b) Introduction: Our goals • Extract differentially expressed genes during the timecourse • Predict membership of genes to corresponding KEGG pathways via InterPro domain signatures and find overrepresented pathways • Reconstruct gene regulatory networks of the differentially expressed genes in these pathways • Compare the learned networks to literature knowledge InterPro Protein
2a) Motivation of pathway prediction: Functional Characterization of Genes Pathway annotation often not available • Only ~25% of all genes (~4,000 / ~20,000) have an annotation in the KEGG database (Kanehisa et al., 2008) • Molecular function (e.g. kinase) • Cellular localization (e.g. nucleus) • Biological processes (e.g. apoptosis) • Pathways (e.g. MAPK signaling) • ...
2a) Idea: predict pathway depending on the domain signature • Proteins in the same pathway should be somehow similar: • Common domains • Domain information of proteins can be retrieved from InterPro database • Information available for ~90% • Use InterPro domains to predict pathway membership gene-wise Gene X InterPro domain signature Classification model KEGG pathway
2a) The KEGG Ontology KEGG KEGG 01100 Metabolism 01100 Metabolism 01200 Genetic Inf. Processing 01200 Genetic Inf. Processing 01300 Envir. Inf. Processing 01300 Envir. Inf. Processing 01400 Cellular Processes 01400 Cellular Processes 01320 Signal Transduction 01320 Signal Transduction 01310 Membrane Transport ... ... ... 04010 MAPK pathway 04010 MAPK pathway Gene X
01100 01200 01300 01400 01310 ... 04010 2a) Hierarchical Classification Model • Classifier should take into account KEGG hierarchy • Predicting something wrong at the top hierarchy (e.g. Metabolism vs. Envir. Inf. Proc.) is worse than confusing e.g. MAPK with WNT pathway. • A specific gene can play a rule in multiple pathways! • => Solution: encoding into a specific loss function y = 0 0 1 0 1 ... 1 u = 1 0 1 0 1 ... 0
2a) Hierarchical Classification Model x = 1 0 1 0 1 ... 1 KEGG Here or other? 01100 Metabolism 01100 Metabolism 01200 Genetic Inf. Processing 01200 Genetic Inf. Processing 01300 Envir. Inf. Processing 01300 Envir. Inf. Processing 01400 Cellular Processes 01400 Cellular Processes Here or other? 01320 Signal Transduction 01310 Membrane Transport 01310 Membrane Transport ... ... Here or other? 04010 MAPK pathway ... 04010 MAPK pathway
2b) Overrepresentation test • Fisher‘s Exact Test • Groups: • Differential/not differential • Predicted path p/ not predicted as path p • Example: • Test for probability of seeing 22 or more differential genes in path p by chance • Done for sets: • pathways already annotated in KEGG • Pathways already annotated in KEGG or predicted by DS
Xt-11 Xt-12 Xt-13 G1 Xt1 Xt2 Xt3 G2 Xt+11 Xt+12 Xt+13 G3 2c) Dynamical Bayesian networks (DBN) DBN G: Regulation motif: Microarray data: ≙ Lebre et. al. 2008
2d) Finding interactions with Ingenuity We use this information to verify the reconstruction quality of our network Given a list of genes, which interactions are already known? Example: What is known about Cyclins D and E, CDK 4 and 6 and Rb? Ingenuity network:
3a. Workflow for the analysis 18310 informative genes 10630 genes to predict Limma analysis Smyth et. al 2004 3385 in KEGG 268 differential 7591 via KEGG and DS 651 differential 18310 informative genes 1002 differentially expressed Fisher's exact test for each pathway Fisher's exact test for each pathway Pathway prediction via DS Fröhlich et. al. 2004 Construct regulatory network for genes in significantly overrrepresented pathways by DBN
3b) Overrepresentation results • Pathways Cell Cycle, Metabolism, Cell Growth and Death significantly overrepresented within the DS prediction group at a 0.05 level • Since we were analyzing an apoptosis study, we chose the pathway cell cycle for further experiments
3. Selected gene profiles for DBN reconstruction • Genes predicted in pathway cell cycle • Only those that were found as differentially expressed
2c) For verification: Ingenuity network for our selected genes
4. Some details on inferred interactions • NASP -> PLK1 • NASP: H1 histone binding protein that is involved in transporting histones into the nucleus of dividing cells • NASP -> MCM • MCM: highly conserved mini-chromosome maintenance proteins that are essential for the initiation of eukaryotic genome replication • UACA -> BUB1 • UACA: regulates morphological alterations required for cell growth and motility • BUB1: gene encodes a kinase involved in spindle checkpoint function
5. Summary • Combine pathway prediction and network inference • Find differential genes, predict pathways and find overrepresented pathways • Compute regulatory network from the expression data • integrate the inferred interactions in known pathway maps • Uses two freely available R-packages • gene2pathway • G1DBN
Acknowledgements • Molecular Genome Analysis • Biostatistics & Modelling • Christian Bender • Holger Fröhlich • Marc Johannes Department head Annemarie Poustka (MGA – Department Head) Died May 2008 Stefan Wiemann acting head of division MGA
3b. Overrepresentation of pathways given: List of genes with corresponding pathway annotation or prediction • test for overrepresentation of all pathways: • Fisher's exact test • once for genes being annotated in KEGG • once for genes with additional pathway membership prediction by DS
3a. Differential expression and pathway prediction Use R-package limma to find differential genes Results in 1002 differentially expressed genes during the timecourse Map all genes on the UniSet Array to KEGG pathways via InterPro domain signatures (DS) for 10630 genes InterPro DS could be found 3385 already annotated in KEGG, 268 differential 4206 genes were assigend to kegg pathways via DS, 353 differential Smyth et al 2004
2b) Approach used in this work Example: Test if X_12 || X_31 | pa(X_12) => yes But: consider only first order dependencies: X_12 || X_31 | pa(1)(X_12) = X_12 || X_31 | X_11 or X_12 || X_31 | X_21 => no So spurious edges can occur in G(1) which disappear in Gschlange. Idea: X11 X12 X21 X22 X31 X32 Lebre et. al. 2008
2a) Hierarchical Classification Model Domain signature x = 1 0 1 0 1 ... 1 SVMs Decision values f Taken from dictionary of possible position vectors Ranking Perceptron (loss function l) Melvin et al., 2007 Trainable weight vector Most probable pathways y = 0 0 1 0 1 ... 1
2a) Hierarchical Classification Model: Training Procedure Data set of domain signatures for each gene Train SVMs Position labeled data set Train ranking perceptron using loss function l weight vector z
2b) Dynamical Bayesian networks Gene 1 Gene 2 Gene 3 Gene p G Kim et. al. 2003
2b). Dynamical Bayes Networks, Lebre et. Al 2008 Define DAG Gfull as DAG that contains all edges between successive Variables: • Then it exists a minimal DAG describing a BN containing all conditional dependencies between variables in successive timepoints: • Any pair of successive variables (X_t-1^j, X_t^i) not adjacent in are conditionally independent given the parents: • This minimal BN description factorizes according to • It will be used to define low-order conditional dependence DAGs for the inference of
Xt-11 Xt-12 Xt-13 Xt1 Xt2 Xt3 Xt+11 Xt+12 Xt+13
Xt-11 Xt-12 Xt-13 Xt1 Xt2 Xt3 Xt+11 Xt+12 Xt+13 2c) Approach used in this work Goal: Infer DAG G´, containing full conditional dependencies Infer DAG G(1) , containing all 1st order dependencies Key observation: G´ ⊆ G(1) Derive G´ from G(1) Lebre et. al. 2008
3a. Workflow for the analysis 18310 informative genes 10630 genes to predict Limma analysis Smyth et. al 2004 3385 in KEGG 268 differential 7591 via KEGG and DS 651 differential 18310 informative genes 1002 differentially expressed Fisher's exact test for each pathway Fisher's exact test for each pathway Pathway prediction via DS Fröhlich et. al. 2004 Construct regulatory network for genes in significantly overrrepresented pathways by DBN