330 likes | 443 Views
APO-SYS workshop on data analysis and pathway charting. Igor Ulitsky Ron Shamir ’ s Computational Genomics Group. Part I: Presentations. EXPANDER AMADEUS SPIKE MATISSE. Part II: Hands-on Session. EXPANDER MATISSE SPIKE. EXP ression AN alyzer and D isplay ER. Adi Maron-Katz
E N D
APO-SYS workshop on data analysis and pathway charting Igor Ulitsky Ron Shamir’s Computational Genomics Group
Part I: Presentations • EXPANDER • AMADEUS • SPIKE • MATISSE
Part II: Hands-on Session • EXPANDER • MATISSE • SPIKE
EXPression ANalyzer and DisplayER Adi Maron-Katz Chaim Linhart Amos Tanay Rani Elkon Israel Steinfeld Seagull Shavit Igor Ulitsky Roded Sharan Yossi Shiloh Ron Shamir http://acgt.cs.tau.ac.il/expander
EXPANDER • Low level analysis: • Missing data estimation (KNN or manual) • Normalization: quantile, loess • Filtering: fold change, variation, t-test • Standardization: mean 0 std 1, take log, fixed norm • High level gene partition analysis: • Clustering • Biclustering • Ascribing biological meaning to patterns: • Enriched functional categories (Gene Ontology) • Identify transcriptional regulators – promoter analysis • Built-in support for 9 organisms: • human, mouse, rat, chicken, zebrafish, fly, worm, arabidopsis, yeast
Input data Normalization/ Filtering Links to public annotation databases Visualization utilities Clustering (CLICK, SOM, K-means, Hierarchical) Biclustering (SAMBA) Functional enrichment (TANGO) Promoter signals (PRIMA)
EXPANDER - Preprocessing • Input data: • Expression matrix (probe-row; condition-column) • One-channel data (e.g., Affymetrix) • Dual-channel data (cDNA microarrays, data are (log) ratios between the Red and Green channels) • ‘.cel’ files • ID conversion file: map probes to genes • Gene sets data • Data definitions: • Defining condition subsets • Data type & scale (log)
EXPANDER – Preprocessing (II) • Data Adjustments: • Missing value estimation (KNN or arbitrary) • Merging conditions Normalization: removal of systematic biases from the analyzed chips • Implemented methods: quantile, lowess • Visualization: box plots, scatter plots (simple, M vs. A)
EXPANDER – Preprocessing (III) • Filtering: Focus downstream analysis on the set of “responding genes” • Fold-Change • Variation • Statistical tests (T-test) • Standardization:Create a common scale • For each probe Mean=0, STD=1 • Log data (base 2) • Fixed Norm (divide by norm of probe vector)
Input data Normalization/ Filtering Links to public annotation databases Visualization utilities Clustering (CLICK, SOM, K-means, Hierarchical) Biclustering (SAMBA) Functional enrichment (TANGO) Promoter signals (PRIMA)
Cluster Analysis • Partition the responding genes into distinct sets, each with a particular expression pattern • Identify major patterns in the data: reduce the dimensionality of the problem • co-expression → co-function • co-expression → co-regulation • Partition the genes to achieve: • Homogeneity: genes inside a cluster show highly similar expression pattern. • Separation: genes from different clusters have different expression patterns.
Cluster Analysis (II) • Implemented algorithms: • CLICK, K-means, SOM, Hierarchical • Visualization: • Mean expression patterns • Heat-maps
Sensors ATM Effectors (p53, BRCA1, CHK2) Survival pathways Cell death pathways Apoptosis Cell cycle arrest Stress responses DNA repair Example study: responses to ionizing radiation Ionizing Radiation Double Strand Breaks
Example study: experimental design • Genotypes: Atm-/- and control w.t. mice • Tissue: Lymph node • Treatment: Ionizing radiation • Time points: 0, 30 min, 120 min • Microarrays: Affymetrix U74Av2 (12k probesets)
Test case - Data Analysis • Dataset: six conditions (2 genotypes, 3 time points) • Normalization • Filtering step – define the ‘responding genes’ set • genes whose expression level is changed by at least 1.75 fold • Over 700 genes met this criterion • The set contains genes with various response patterns – we applied CLICK to this set of genes
Major Gene Clusters – Irradiated Lymph node Atm-dependent early responding genes
Major Gene Clusters – Irradiated Lymph node Atm-dependent 2nd wave of responding genes
Input data Normalization/ Filtering Links to public annotation databases Visualization utilities Clustering (CLICK, SOM, K-means, Hierarchical) Biclustering (SAMBA) Functional enrichment (TANGO) Promoter signals (PRIMA)
Ascribe Functional Meaning to the Clusters • Gene Ontology (GO) annotations for human, mouse, rat, chicken, fly, worm, Arabidopsis, Zebrafish and yeast. • TANGO: Apply statistical tests that seek over-represented GO functional categories in the clusters.
Functional Categories cell cycle control (p<1x10-6 )
Functional Categories Cell cycle control (p<5x10-6) Apoptosis (p=0.001)
Input data Normalization/ Filtering Links to public annotation databases Visualization utilities Clustering (CLICK, SOM, K-means, Hierarchical) Biclustering (SAMBA) Functional enrichment (TANGO) Promoter signals (PRIMA)
Clues are in the promoters Identify Transcriptional Regulators ATM Hidden layer NEW ? TF-C ? TF-B ? ? ? p53 TF-A Observed layer g13 g12 g11 g10 g9 g8 g7 g6 g5 g4 g3 g2 g1
‘Reverse engineering’ of transcriptional networks • Infers regulatory mechanisms from gene expression data • Assumption: co-expression → transcriptional co-regulation → common cis-regulatory promoter elements • Step 1: Identification of co-expressed genes using microarray technology (clustering algs) • Step 2: Computational identification of cis-regulatory elements that are over-represented in promoters of the co-expressed gene
PRIMA – general description • Input: • Target set(e.g., co-expressed genes) • Background set (e.g., all genes on the chip) • Analysis: • Identify transcription factors whose binding site signatures are enriched in the ‘Target set’ with respect to the ‘Background set’. • TF binding site models – TRANSFAC DB • Default: From -1000 bp to 200 bp relative the TSS
PRIMA – Results NF-B 5.1 3.8x10-8 p53 4.2 9.6x10-7 STAT-1 3.2 5.4x10-6 Sp-1 1.7 6.5x10-4
Input data Normalization/ Filtering Links to public annotation databases Visualization utilities Clustering (CLICK, SOM, K-means, Hierarchical) Biclustering (SAMBA) Functional enrichment (TANGO) Promoter signals (PRIMA)
Biclustering • Clustering becomes too restrictive on large datasets: • Seeks global partition of genes according to similarity in their expression across ALL conditions • Relevant knowledge can be revealed by identifying genes with common pattern across a subset of the conditions • Biclustering algorithmic approach
A. Tanay, R. Sharan, R. Shamir RECOMB 02 * Bicluster(=module): subset of genes with similar behavior in a subset of conditions * Computationally challenging: has to consider many combinations of sub-conditions Biclustering: SAMBAStatistical Algorithmic Method for Bicluster Analysis