In previous lectures: -- Identifying differentially expressed genes from replicates

In previous lectures: -- Identifying differentially expressed genes from replicates Parametric tests: -- T-test (are the means of 2 samples different?) -- ANOVA (are the means of 2 or more samples different?) Bayesian methods: baySeq -- Bonferroni vs. q-value & other FDR corrections for multiple testing FDR (false discovery rate): cites % false positives in SET of called genes -- Defining sensitivity and specificity baySeq issues??

Now you have selected a subset of genes to focus on … But even then, there is often still an overwhelming amount of data. Need some strategies to simplify the analysis & visualization

One Strategy: focus initially on groups of genes

Array 1 Array 2 Array 3 Gene X: X1 X2 X3 x coordinate z coordinate y coordinate

Practically speaking, the Pearson correlation R is the sum of all pairwise comparisons of the gene expression values in two gene expression vectors N 1 (Xi – X)(Yi – Y) S Standard Pearson Correlation: R x,y = N SDx SDy i = 1 Array 1 Array 2 Array 3 Array 4 Array 5 Gene X: X1 X2 X3 X4 X5 Gene Y: Y1 Y2 Y3 Y4 Y5 Pearson correlation ranges from –1 (anticorrelated), 0 (uncorrelated) , 1 (identical)

Practically speaking, the Pearson correlation R is the sum of all pairwise comparisons of the gene expression values in two gene expression vectors N 1 (Xi) (Yi) S UncenteredPearson Correlation: (set the means of X and Y to 0) R x,y = N N i = 1 1 N 2 1 S 2 S Xi Yi N N i = 1 i = 1 Array 1 Array 2 Array 3 Array 4 Array 5 Gene X: X1 X2 X3 X4 X5 Gene Y: Y1 Y2 Y3 Y4 Y5 Using Standard Pearson Correlation: similar pattern + constant offset = P. corr of 1.0 Using Uncentered Pearson Correlation: similar pattern + constant offset not = 1.0

Sometimes, want to use the weighted Pearson correlation N 1 (Xi) (Yi) S P x,y = N N i = 1 1 N 2 1 S 2 S Xi Yi N N i = 1 i = 1 Array 1 Array 2 Array 3 Array 4 Array 5 Gene X: X1 X2 X3 X4 X5 Gene Y: Y1 Y2 Y3 Y4 Y5 For example: if these arrays are identical, the data are over-represented 3X You will experiment with this in lab

Excellent review by J. Quakenbush 2001 Nature Reviews-Genetics

Hierarchical clustering Goal is organize the entire dataset into one hierarchical arrangement. Know as a “bottom up” or agglomerative clustering method. Two parts: 1) Calculating gene similarity 2) Organizing genes such that similarly expressed genes are group together

Two steps of hierarchical clustering 1. Calculating the similarity matrix Calculate the Pearson correlation for every pair of genes

Two steps of hierarchical clustering 1. Calculating the similarity matrix End up with a symmetrical table of Pearson correlations

Two steps of hierarchical clustering 1. Calculating the similarity matrix Gene 2 Gene 5 Find the largest P. corr & join those genes together on a node

Two steps of hierarchical clustering 1. Calculating the similarity matrix Gene 2 Gene 5 Should Gene 10 get added onto this node?

4. Centroid linkage clustering

‘centroid’ (average vector) 4. Centroid linkage clustering

Visualization: Data are often converted to a colorimetric scale Each box: a transcript measurement Each row of boxes: transcript measurements for a given gene Each column of boxes: transcript measurements from a single array Red: higher transcript abundance in one sample Green: higher transcript abundance in the other sample

Unweighted Pearson correlation (red/green version) (blue/yellow version)

In previous lectures: -- Identifying differentially expressed genes from replicates