Context-Specific Bayesian Clustering for Gene Expression Data

Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University .

Introduction • New experimental methods  abundance of data • Gene Expression • Genomic sequences • Protein levels • … • Data analysis methods are crucial for understanding such data • Clustering serves as tool for organizing the data and finding patterns in it

This Talk • New method for clustering • Combines different types data • Emphasis on learning context-specific description of the clusters • Application to gene expression data • Combine expression data with genomic information

j k Microarray Data Genomic Data The mRNA level of gene i in experiment j The # of binding sites of TF j in promotor region of gene i The Data Experiments Binding Sites Goal: • Understand interactions between TF and expression levels Genes i

Cluster … … TF1 TF2 TFk TF3 A1 A2 An A3 Simple Clustering Model • attributes are independent given the cluster • Simple model  computationally cheap • Genes are clustered according to both expression levels and binding sites

Cluster A1 A2 Local Probability Models TF1 TF2 Multinomial Gaussian

Cluster A1 A2 Structure in Local Probability Models TF1 TF2

Cluster {1,2,3,4,5} {2,4} {} {1,2,4} E1 E2 TF1 TF2 Context Specific Independence Benefits: • Identifies what features characterize each cluster • Reduces bias during learning • A compact and efficient representation

Scoring CSI Cluster Models • Represent conditional probabilities with different parametric families • Gaussian, • Multinomial, • Poisson … • Choose parameters priors from appropriate conjugate prior families Score: where MarginalLikelihood Prior

C C ? {1,2,3} {2,4} {1,2,3} {3} {2,4} {} {} {2} E1 E1 E1 E2 E2 E2 TF1 TF1 TF1 TF2 TF2 TF2 ? C Try “nearby” structures and Learn parameters for each one using EM. choose best structure {1,2,3} Learn model parameters using EM {} {3} {2} Learning Structure – Naive Approach • A hard problem : • “Standard” approach: Basic problem – efficiency

Soft assignment for genes C {1,2,3} {2,4} {} {2} Compute expected sufficient statistics Learn model parameters using EM E1 E1 E1 E2 E2 E2 TF1 TF1 TF1 TF2 TF2 TF2 Use the “completed” data to evaluate each edge separately to find best model ? C C ? {} {2,4} {1,2,3} {1,2,3} {3} {3} {} {2} Learning Structure – Structural EM We can evaluate each edge’s parameters separately given complete data for MAP we compute EM only once for each iteration Guaranteed to converge to a local optimum

Learned clusters were very informative Results on Synthetic Data Basic approach: • Generate data from a known structure • Evaluate learned structures for different sample numbers (200 – 800). • Add “noise” of unrelated samples to the training set to simulate genes that do not fall into “nice” functional categories (10-30%). • Test learned model for structure as well as for correlation between it’s tagging and the one given by the original model. Main results: Cluster number: models with fewer clusters were sharply penalized. Often models with 1-2 additional clusters got similar score , with “degenerate” clusters none of the real samples where classified to. Structure accuracy: very few false negative edges , 10-20% false positive edges (score dependent) Mutual information Ratio: max for 800 samples , 100-95% for 500 and 90%~ for 200 samples.

Yeast Stress Data (Gasch et al 2001) • Examines response of yeast to stress situations • Total 93 arrays • We selected ~900 genes that changed in a selective manner Treatment steps: • Initial clustering • Found putative binding sites based on clusters • Re-clustered with these sites

Stress Data -- CSI Clusters

Nitrogen Dep. HSF variable Diauxic shift Menadione Starvation diamide sorbitol Steady H2O2 DDT YPD HSF YP 4 3 2 1 mean expression level 0 -1 -2 CSI Clusters

Nitrogen Dep. HSF variable Diauxic shift Menadione Starvation diamide sorbitol Steady H2O2 DDT YPD HSF YP 4 3 2 1 mean expression level 0 -1 -2 Promoters Analysis Cluster 3 • MIG1 CCCCGC, CGGACC, ACCCCG • GAL4 CGGGCC • Others CCAATCA

Nitrogen Dep. HSF variable Diauxic shift Menadione Starvation diamide sorbitol Steady H2O2 DDT YPD HSF YP 4 3 2 1 mean expression level 0 -1 -2 Promoters Analysis Cluster 7 • GCN4 TGACTCA • Others CGGAAAA, ACTGTGG

Discussion Goals: • Identify binding sites/transcription factors • Understand interactions among transcription factors • “Combinatorial effects” on expression • Predict role/function of the genes Methods: • Integration of model of statistical patterns of binding sites (see Holmes & Bruno, ISMB’00) • Additional dependencies among attributes • Tree augmented Naive Bayes • Probabilistic Relational Models (see poster)

Context-Specific Bayesian Clustering for Gene Expression Data