Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004

Variability & Statistical Analysisof Microarray DataGCAT – Georgetown July 2004 Jo Hardin Pomona College jo.hardin@pomona.edu

Variability • key to statistics • within slide vs. between slide • replication • red (Cy5) > green dye (Cy3): dye swap • log (base 2) transformation

Example: • Variation in Gene Expression Patterns in Follicular Lymphoma and the Response to Rituximab, by Bohen, Troyanskaya, Alter, Warnke, Botstein, Brown, and Levy • 2 groups: those who responded to treatment, and those who did not respond to treatment. • Cy5 dye used on malignant lymphoid tissues, Cy3 dye used on mRNA derived from cell lines • Biopsies obtained before treatment of Rituximab • Are there differences in gene expression across those who responded to treatment and those who didn’t?

Data Cleaning • Individual points were median centered for each cDNA clone and filtered for data quality. • Data values are either:

The Data:

Differential Expression Across Two Groups • Fold Change • t-test • Wilcoxon Rank-Sum Test • SAM

Fold Change • Of mean? Of median? • Across treatment groups? vs. reference group? • Small vs. large values • What about how variable the groups are?

An Example using one gene:

t-test • Test statistic: • p-value = probability of seeing your data or more extreme if there is no difference in the groups

t-test in Excel • Syntax: TTEST(array1,array2,tails,type) • Example: • first group is in cells c3 – k3 • second group is in cells l3 – v3 • we want a two sided t-test (no preconceived idea about which group is more highly expressed) • we assume the variance is unequal in cell w3 type: “=ttest(c3:k3,l3:v3,2,3)”

Wilcoxon Rank Sum Test • Instead of comparing averages, this test compares rankings (or medians) • In order to discount influential points, we replace the data values with their appropriate rankings. • We compute a z-test (sister of the t-test) on the ranked data.

Up regulated genes Down regulated genes

Technical Details • Replace values with ranks • Sum the ranks in the first group • Calculate hypothesized mean1 = n1*(n1+n2+1)/2 • Calculate hypothesized standard deviation1 = sqrt(n1*n2*(n1+n2+1)/12) • Calculate test statistic = (sum ranks – hyp mean1) / hyp stdev1 • Find the p-value using the normal distribution (probability of being greater than the test statistic if there are no differences in the two groups)

Wilcoxon Rank Sum in Excel • Using the rank function, translate your data into ranks • Y3: “=RANK(C3,C3:V3)” this finds the rank of C3 in the range C3-V3 (you’ll probably get a “#value” here, that’s OK because C3 is empty for gene = IMAGE:253507) • Repeat this command for Z3 to AR3 keeping the second half of the function always C3:V3 • Copy the row from Y3 to AR3 and paste from Y4 to AR2366 • AS2: “=SUMIF(Y3:AG3,">0",Y3:AG3)” (sum rank grp1) • AT2: “=COUNT(Y3:AG3)*(COUNT(Y3:AR3)+1)/2” (mean1) • AU2: “=SQRT(COUNT(Y3:AG3)*COUNT(AH3:AR3)* (COUNT(Y3:AR3)+1)/12)” (stdev1) • AV2: “=(AS3-AT3)/AU3” (zscore1 = test stat) • AW2: “=2*(1-NORMDIST(ABS(AV3),0,1,TRUE))” (p-value)

SAM (Significance Analysis of Microarrays) • is a statistical technique for finding significant genes in a set of microarray experiements • can be used in a comparison experiment • can also be used with a quantitative response (like tumor size) or with one class data

Technical Details • For the ith gene, comparing two groups, the test statistic is: • Rank the di and keep as test statistics • Permute the data labels 100 times, and calculate expected values for the di given no structure. • Plot observed di vs. expected di

False Discovery Rate • We know that the expected di were computed with no group structure. • Any “large” expected di values will be false positives. • If we see 30 observed di above some cutoff and 10 expected di above the same cutoff, we know that we probably have 10 false positives (though we can never know *which* genes are the false positives)

Features of SAM • Slider – we can change the false discovery rate • Fold change – in addition to the false discovery rate, we can require the genes to be at some fold change threshold (on average) • Gene lists – gene lists are given along with corresponding significance levels • Web Link option for more information about particular genes

Imputation • Most microarray data has missing values • If background is bigger than foreground, the observed signal will be negative! • Poor quality spots are removed prior to analysis. • SAM needs a full data set which can be computed by: • Substitution of the row average • Substitution using k-nearest neighbors

Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004