380 likes | 542 Views
Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010. Microarray Technology. Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously. Two types of platforms: Affymetrix (single-color)
E N D
Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010
Microarray Technology • Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously. • Two types of platforms: • Affymetrix (single-color) • Two-color microarray
Wild-type vs. Myostatin Knockout Mice Belgian Blue cattle have a mutation in the myostatin gene. Design of Affymetrix experiment: one sample one chip
Designing 2-color microarray (3 layers) From Churchill, 2002, nature genetics
M B V bundle sheath strands mesophyll protoplasts Example I: Sawers et al, 2007, BMC Bioinformatics
Example I: Sawers et al, 2007, BMC Bioinformatics • The establishment of C4 photosynthesis in maize is associated with differential accumulation of gene transcripts and proteins between bundle sheath and mesophyll photosynthetic cell types. • Goal: To detect genes that are differentially expressed in Bundle Sheath (B) and Mesophyll (M) cells.
Example I: Sawers et al, 2007, BMC Bioinformatics • A simple method: Isolate cells and perform a microarray experiments to compare the gene expression between the two cells (treatments).
Example I: Sawers et al, 2007, BMC Bioinformatics • A little more complication: The procedure for extracting mRNA for the two cells are different. The one to extract mRNA from M cells introduces stress. • Solution: Add two more treatment groups: samples with both M and B cells going through extraction of mRNA with and without stress. B, M, Stress and Total (4 treatment groups)
Direct comparison vs indirect comparison • Direct: comparison within slide • Indirect: comparison between slides • Suppose we want to compare gene expression levels between treatment 1 and treatment 2. 2 1 2 1 R 2 1 Direct Comparison Indirect Comparison
Comments about 2-color Microarray Designs • A unique and powerful feature of 2-color microarray is to make direct comparison between two samples on the same slide. • For pairing samples, the variation due to slide can be accounted for. • When possible, it is more efficient to use direct comparison. • However, sometimes, it is not practical to make direct comparison of all possible pairs.
Efficiency of comparison • The efficiency of comparisons between 2 samples is determined by the length and the number of paths connecting them. 2 1 2 1 R 2 1 Direct Comparison (Dye-swap) Indirect Comparison
Reference vs Loop design 2 1 2 1 3 3 R Reference Design Loop Design
B Total Stress M Designing experiment for example I With 6 biological replicates
After the bench work… Affymetrix Gene Chip image 2-color microarray image
Pre-normalization analysis • Image processing • obtain the intensity measurement of the signal • Background correction • get rid of local background that might due to non-specific binding and obtain the target sample intensity • Filtration • remove unreliable spots and reduce the dimension of data • Transformation • convert data into a format that makes data analysis valid or easier
Normalization • Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected. • Aim: remove sources of systematic variation • Example of non-biological variation: dye difference for 2-color microarray
Figure from Dudoit et al, 2002, Statistica Sinica Self-self experiment
Normalization: M vs. A Plot (45o rotation) Log Red-Log Green = M (Log Green+Log Red)/2 = A
LOWESS Fit Log Red-Log Green (Log Green+Log Red)/2
After normalization Normalized M A
Y224 Y114 dye slide treatment Statistical Inference • Data notation for normalized signal intensities (NSI): Yijk for each gene (g) i: treatment index j: dye index k: slide index
Fitting linear models to microarray data • After the normalization, we have one observation (normalized signal intensity) for each gene on each channel (a combination of dye and array). • Together, the data is an array with each row for one gene and each column for one channel or one chip. • We will fit a statistical model for each gene separately.
Mean expressions for 4 treatment groups Treatments means • M (M cell with stress) μ+v2+ • B (B cell without stress) μ+v1 • TO (both cells without stress) μ+c*v2+ (1-c)*v1 • ST (both cells with stress) μ+c*v2+ (1-c)* v1+ • Note that c is the proportion of M cells in the total leaf sample with both cells. • We are interested in testing H0: v1 = v2, whether a given gene is differentially expressed between M and B cells or not.
Fixed effects • The parameters on the previous slide (v1, v2, and ) specify fixed effects. • Fixed effects are used to specify the mean of the response variable. • A factor is fixedif the levels of the factor were selected by the investigator with the purpose of comparing the effects of the levels to one another. • The fixed effects included in the model depend on the experimental design.
Random effects • There are some random effects that are unknown: • slide effects • other effects introduced in the experiment (such as biological replicate effects) • residual random effects that include any sources of variation unaccounted for by other terms B Total Stress M
Random effects • Random factors are used to specify the correlation structure among the response variable observations. • e.g., observations on the same slide are more correlated than observations from different slides. • The random effects included in the model also depend on the experimental design. • A model that has both fixed and random effects is called a mixed model.
Detecting differentially expressed genes • Construct statistical test for parameters that we are interested in, e.g., what are the difference in gene expression (v1 - v2)? v1 - v2 0 means differential expression. • Model the random effects and perform tests or construct confidence intervals. • Perform tests for each gene and obtain a p-value. • Empirical Bayes test that borrows information across genes is often used because of higher power.
2536 p-values below 0.05. 0.05 We would expect around 0.05*40000=2000 p-values to be less than 0.05 by chance if no genes were differentially expressed.
Possible Errors in Testing ONE gene • Type I Error: false positives • Type II Error: false negatives (1-power) • Power: true positives
Error Rate in Multiple Testing Outcomes when testing m genes (Benjamini and Hochberg, 1995) Family-wise error rate, FWER= Pr(V >0) False Discovery Rate, FDR = E(V/R |R>0) * Pr(R>0)
Clustering • Grouping genes into different “clusters” based on their expression profile Clustering
Other analyses • Relating the gene expressions with biological functional categories Gene Enrichment Test • Connecting microarray data with other kinds of data such as survival data. • More …
Assigned References • Nettleton, D. (2006) A Discussion of statistical methods for design and analysis of microarray experiments for plant scientists. The Plant Cell,18, 2112–2121.