APO-SYS workshop on data analysis and pathway charting

APO-SYS workshop on data analysis and pathway charting Igor Ulitsky Ron Shamir’s Computational Genomics Group

Part I: Presentations • EXPANDER • AMADEUS • SPIKE • MATISSE

Part II: Hands-on Session • EXPANDER • MATISSE • SPIKE

EXPression ANalyzer and DisplayER Adi Maron-Katz Chaim Linhart Amos Tanay Rani Elkon Israel Steinfeld Seagull Shavit Igor Ulitsky Roded Sharan Yossi Shiloh Ron Shamir http://acgt.cs.tau.ac.il/expander

EXPANDER • Low level analysis: • Missing data estimation (KNN or manual) • Normalization: quantile, loess • Filtering: fold change, variation, t-test • Standardization: mean 0 std 1, take log, fixed norm • High level gene partition analysis: • Clustering • Biclustering • Ascribing biological meaning to patterns: • Enriched functional categories (Gene Ontology) • Identify transcriptional regulators – promoter analysis • Built-in support for 9 organisms: • human, mouse, rat, chicken, zebrafish, fly, worm, arabidopsis, yeast

Input data Normalization/ Filtering Links to public annotation databases Visualization utilities Clustering (CLICK, SOM, K-means, Hierarchical) Biclustering (SAMBA) Functional enrichment (TANGO) Promoter signals (PRIMA)

EXPANDER - Preprocessing • Input data: • Expression matrix (probe-row; condition-column) • One-channel data (e.g., Affymetrix) • Dual-channel data (cDNA microarrays, data are (log) ratios between the Red and Green channels) • ‘.cel’ files • ID conversion file: map probes to genes • Gene sets data • Data definitions: • Defining condition subsets • Data type & scale (log)

EXPANDER – Preprocessing (II) • Data Adjustments: • Missing value estimation (KNN or arbitrary) • Merging conditions Normalization: removal of systematic biases from the analyzed chips • Implemented methods: quantile, lowess • Visualization: box plots, scatter plots (simple, M vs. A)

EXPANDER – Preprocessing (III) • Filtering: Focus downstream analysis on the set of “responding genes” • Fold-Change • Variation • Statistical tests (T-test) • Standardization:Create a common scale • For each probe Mean=0, STD=1 • Log data (base 2) • Fixed Norm (divide by norm of probe vector)

Cluster Analysis • Partition the responding genes into distinct sets, each with a particular expression pattern • Identify major patterns in the data: reduce the dimensionality of the problem • co-expression → co-function • co-expression → co-regulation • Partition the genes to achieve: • Homogeneity: genes inside a cluster show highly similar expression pattern. • Separation: genes from different clusters have different expression patterns.

Cluster Analysis (II) • Implemented algorithms: • CLICK, K-means, SOM, Hierarchical • Visualization: • Mean expression patterns • Heat-maps

Sensors ATM Effectors (p53, BRCA1, CHK2) Survival pathways Cell death pathways Apoptosis Cell cycle arrest Stress responses DNA repair Example study: responses to ionizing radiation Ionizing Radiation Double Strand Breaks

Example study: experimental design • Genotypes: Atm-/- and control w.t. mice • Tissue: Lymph node • Treatment: Ionizing radiation • Time points: 0, 30 min, 120 min • Microarrays: Affymetrix U74Av2 (12k probesets)

Test case - Data Analysis • Dataset: six conditions (2 genotypes, 3 time points) • Normalization • Filtering step – define the ‘responding genes’ set • genes whose expression level is changed by at least 1.75 fold • Over 700 genes met this criterion • The set contains genes with various response patterns – we applied CLICK to this set of genes

Major Gene Clusters – Irradiated Lymph node Atm-dependent early responding genes

Major Gene Clusters – Irradiated Lymph node Atm-dependent 2nd wave of responding genes

Ascribe Functional Meaning to the Clusters • Gene Ontology (GO) annotations for human, mouse, rat, chicken, fly, worm, Arabidopsis, Zebrafish and yeast. • TANGO: Apply statistical tests that seek over-represented GO functional categories in the clusters.

Functional Enrichment - Visualization

Functional Categories cell cycle control (p<1x10-6 )

Functional Categories Cell cycle control (p<5x10-6) Apoptosis (p=0.001)

Clues are in the promoters Identify Transcriptional Regulators ATM Hidden layer NEW ? TF-C ? TF-B ? ? ? p53 TF-A Observed layer g13 g12 g11 g10 g9 g8 g7 g6 g5 g4 g3 g2 g1

‘Reverse engineering’ of transcriptional networks • Infers regulatory mechanisms from gene expression data • Assumption: co-expression → transcriptional co-regulation → common cis-regulatory promoter elements • Step 1: Identification of co-expressed genes using microarray technology (clustering algs) • Step 2: Computational identification of cis-regulatory elements that are over-represented in promoters of the co-expressed gene

PRIMA – general description • Input: • Target set(e.g., co-expressed genes) • Background set (e.g., all genes on the chip) • Analysis: • Identify transcription factors whose binding site signatures are enriched in the ‘Target set’ with respect to the ‘Background set’. • TF binding site models – TRANSFAC DB • Default: From -1000 bp to 200 bp relative the TSS

Promoter Analysis - Visualization

PRIMA - Results

PRIMA – Results NF-B 5.1 3.8x10-8 p53 4.2 9.6x10-7 STAT-1 3.2 5.4x10-6 Sp-1 1.7 6.5x10-4

Biclustering • Clustering becomes too restrictive on large datasets: • Seeks global partition of genes according to similarity in their expression across ALL conditions • Relevant knowledge can be revealed by identifying genes with common pattern across a subset of the conditions • Biclustering algorithmic approach

A. Tanay, R. Sharan, R. Shamir RECOMB 02 * Bicluster(=module): subset of genes with similar behavior in a subset of conditions * Computationally challenging: has to consider many combinations of sub-conditions Biclustering: SAMBAStatistical Algorithmic Method for Bicluster Analysis

Biclustering Visualization

APO-SYS workshop on data analysis and pathway charting