720 likes | 861 Views
Statistical Issues in the Design of Microarray Experiment. Jean Yee Hwa Yang University of California, San Francisco http://www.biostat.ucsf.edu/jean/ Stat 246 Lecture 18, Mar 30, 2004. *. *. *. *. *. SAGE. Nylon membrane. Different Technologies. Illumina Bead Array.
E N D
Statistical Issues in the Design of Microarray Experiment Jean Yee Hwa Yang University of California, San Francisco http://www.biostat.ucsf.edu/jean/ Stat 246 Lecture 18, Mar 30, 2004
* * * * * SAGE Nylon membrane Different Technologies Illumina Bead Array GeneChipAffymetrix cDNA microarray Agilent: Long oligo Ink Jet CGH
Life Cycle Biological question Experimental design Failed Microarray experiment Quality Measurement Image analysis Preprocessing Normalization Pass Analysis Clustering Discrimination Estimation Testing Biological verification and interpretation
Experimental design Proper experimental design is needed to ensure that questions of interest canbe answered and that this can be done accurately, given experimental constraints, such as cost of reagents and availability of mRNA.
cDNA “A” Cy5 labeled cDNA “B” Cy3 labeled Definition of probe and target TARGET PROBE
Design of the array Allocation of mRNA samples to the slides cDNA “A” Cy5 labeled cDNA “B” Cy3 labeled Two main aspects of array design Arrayed Library (96 or 384-well plates of bacterial glycerol stocks) Spot as microarray on glass slides Hybridization
Design of the array Allocation of mRNA samples to the slides Two main aspects of array design MT WT Hybridization
Some aspects of design 1. Layout of the array Which sequence to print? • Affymetrix – selecting the PM and MM’s. • cDNA Library – Riken, NIA etc • Selecting oligo probes: • Operon (commercial). • Aglient (commercial). • Illumina (commercial). • OligoPicker (Wang & Seed, Harvard) • ArrayOligoSelector (Bozdech etal, DeRisi Lab, UCSF) • OligoArray 1.0 (Rouillard, Herbert and Zuker) Controls: Normalization and quality checks. Number: Duplicate or replicate spots within a slide position.
Commonly asked questions • Should we put duplicates on a slides. [Discussion in Smyth et al (2003)] • What should be the percentage of control spots? • Where should the control spots be placed? [These relates to preprocessing such as quality assessment and normalization].
External Controls External controls From Van de Peppel et al, 2003
References • Microarray sample pool. • Yang et al (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, Vol. 30, No. 4, e15. • Titrated external controls. • J. van de Peppel, et al (2003). Monitoring global mRNA changes with externally controlled microarray experiments. Embo Reports 4: 387-393. • An example of doping controls for microarray. • http://genome-www.stanford.edu/turnover/supplement.shtml • Commercial • Lucidea Universal ScoreCard by Amersham. http://www1.amershambiosciences.com/
Some aspects of design2. Allocation of samples to the slides A Types of Samples • Replication – technical, biological. • Pooled vs individual samples. • Pooled vs amplification samples. B Different design layout • Scientific aim of the experiment. • Robustness. • Extensibility. • Efficiency. Taking physical limitations or cost into consideration: • the number of slides. • the amount of material. This relates to both Affymetrix and two color spotted array.
Avoidance of bias • Conditions of an experiment; mRNA extraction and processing, the reagents, the operators, the scanners and so on can leave a “global signature” in the resulting expression data. • Randomization. • Local control is the general term used for arranging experimental material.
Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization
Preparing mRNA samples: Mouse model Dissection of tissue Biological Replicates RNA Isolation Amplification Probe labelling Hybridization
Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Technical replicates Amplification Probe labelling Hybridization
Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Technical replicates Amplification Probe labelling Hybridization
Technical replication - amplification • Olfactory bulb experiment: • 3 sets of Anterior vs Dorsal performed on different days • #10 and #12 were from the same RNA isolation and amplification • #12 and #18 were from different dissections and amplifications • All 3 data sets were labeled separately before hybridization Data provided by Dave Lin (Cornell)
Pooling: looking at very small amount of tissues Mouse model Dissection of tissue RNA Isolation Pooling Probe labelling Hybridization
Pooled vs Individual samples Design 1 Design 2 Taken from Kendziorski etl al (2003)
Pooled vs Individual samples • Pooling is seen as “biological averaging”. • Trade off between • Cost of performing a hybridization. • Cost of the mRNA samples. Cost or mRNA samples << Cost per hybridization Pooling can assists reducing the number of hybridization.
Pooled vs amplified samples amplification amplification Design A amplification amplification pooling Design B pooling Original samples Amplified samples
Pooled vs Amplified samples • In the cases where we do not have enough material from one biological sample to perform one array (chip) hybridizations. Pooling or Amplification are necessary. • Amplification • Introduces more noise. • Non-linear amplification (??), different genes amplified at different rate. • Able to perform more hybridizations. • Pooling • Less replicates hybridizations.
References • Pooling vs Non-Pooling • Han, E.-S., Wu, Y., Bolstad, B., and Speed, T. P. (2003). A study of the effects of pooling on gene expression estimates using high density oligonucleotide array data. Department of Biological Science, University of Tulsa, February 2003. • Kendziorski, C.M., Y. Zhang, H. Lan, and A.D. Attie. (2003). The efficiency of mRNA pooling in microarray experiments.Biostatistics4, 465-477. 7/2003 • Xuejun Peng, Constance L Wood, Eric M Blalock, Kuey Chu Chen, Philip W Landfield, Arnold J Stromberg (2003). Statistical implications of pooling RNA samples for microarray experiments.BMC Bioinformatics 4:26. 6/2003
Some aspects of design2. Allocation of samples to the slides A Types of Samples • Replication – technical, biological. • Pooled vs individual samples. • Pooled vs amplification samples. B Different design layout • Scientific aim of the experiment. • Robustness. • Extensibility. • Efficiency. Taking physical limitation or cost into consideration: • the number of slides. • the amount of material.
Graphical representation Vertices: mRNA samples; Edges: hybridization; Direction:dye assignment. Cy5 sample Cy3 sample
Graphical representation • The structure of the graph determines which effects can be estimated and the precision of the estimates. • Two mRNA samples can be compared only if there is a pathjoining the corresponding two vertices. • The precision of the estimated contrast then depends on the number of paths joining the two vertices and is inversely related to the length of the paths. • Direct comparisons within slides yield more precise estimates than indirect ones between slides.
Experiment for which the common reference design is appropriate Meaningful biological control (C) Identify genes that responded differently / similarly across two or more treatments relative to control. Large scale comparison. To discover tumor subtypes when you have many different tumor samples. Advantages: Ease of interpretation. Extensibility - extend current study or to compare the results from current study to other array projects. T2 Tn Tn-1 T1 Ref Common reference design
Experiment for which a number of designs are suitable for use Time Series T2 T4 T1 T3 T1 T2 T3 T4 Ref T1 T2 T3 T4 T2 T4 T1 T3
4 samples Experiment for which a number of designs are suitable for use A B A.B C A C A C A A.B A.B B A.B B B C
Comparing 2 classes of estimates direct vs indirect estimates
The simplest design question:Direct versus indirect comparisons Two samples (A vs B) e.g. KO vs. WT or mutant vs. WT Indirect Direct A R A B B average (log (A/B)) log (A / R) – log (B / R ) 2 /2 22 These calculations assume independence of replicates: the reality is not so simple.
In practice: data Wt Wt Mutant Mutant Method A Method B The analyses can be done intwoways: • Exclude the 2 wt vs. wt arrays (one-sample) • Include the 2 wt vs. wt arrays (two-samples) Data provided by Grant Hartzog (UCSC)
Experimental results • 5 sets of experiments with similar structure. • Compare (Y axis) A) SE for aveMmt B) SE for aveMmt – aveMwt • Theoretical ratio of (A / B) is 1.6 • Experimental observation is 1.1 to 1.4. SE
Sample size Data provided by Matt Callow
Caveat • The advantage of direct over indirect comparisons was first pointed out by Churchill & Kerr, and in general, we agree with the conclusion. However, you can see in the last MA-plot that the difference is not a factor of 2, as theory predicts. • Why? Possibly because mRNA from the same extractions - and pools of controls or reference material are the norm - give correlated expression levels. In other words, the assumption of independence between log(T/Ref) and log(C/Ref) is not valid.
Preparing mRNA samples: Mouse model Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization
Extreme technical replication - labeling • 3 sets of self – self hybridization: (cerebellum vs cerebellum) • Data 1 and Data 2 were labeled together and hybridized on two slides separately. • Data 3 was labeled separately. • Comparing log-ratios between the 3 experiments Data 3 Data 2 Data 1 Data provided by Elva Diaz (UC Davis) Data 1
Pairs plot of log-intensity Looking only at constantly expressed genes only A B A’ B’ Technical replicates Data provided by Grant Hartzog and Todd Burcin from UCSC a= log2A and b = log2B
A B a = log2A and b = log2B a’ = log2A’ and b’ = log2B’ t2= var(a) ; variance of log signal g1 = cov(a, b); covariance between measurements on samples on the same slide. g2 = cov(a, a’); covariance between measurements on technical replicates from different slides. g3 = cov(a, b’); covariance between measurements on samples which are not technical replicates and not on the same slide.
A B B Implication for design s2 = var(a-b) = 2(t2 – g1) c0 = cov(a – b, c – d) = 0 c1 = cov(a – b, a’ – b) = g2 – g3 c2 = cov(a – b, a’ – b’) = 2(g2 – g3) = 2c1 A B A C D C
Direct vs Indirect - revisited Two samples (A vs B) e.g. KO vs. WT or mutant vs. WT Indirect Direct A R A B B y = (a – r) - (b – r’) Var(y) = 22 - 2c1 y = (a – b) + (a’ – b’) Var(y/2) = 2 /2 + c1 2 = 2c1 efficiency ratio (Indirect / Direct) = 1 1= 0 efficiency ratio (Indirect / Direct) = 4
Summary • Create highly correlated reference samples to overcome inefficiency in common reference design. • Not advocating the use of technical replicates in place of biological replicates for samples of interest.
Gene expression data Gene expression data on G genes for n hybridization. Gene by array data matrix mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene i in mRNA sample j Log( Red intensity / Green intensity) M = Function (PM, MM) using MAS or dchip or RMA
Combining data across arrays How can we design experiments and combine data across slides to provide accurate estimates of the effects of interest? Linear models and extensions thereof can be used to effectively combine data across arrays for complex experimental designs. A C Experimental design Regression analysis AB B
Indirect design (ii) Direct design Comparing K treatments C A C B A B R Question: Which design gives the most precise estimates of the contrasts A-B, A-C, and B-C?
Common reference design A B C Log ratios: y Model: E(y1) = A – R E(y2) = B – R E(y3) = C – R Question 1: Estimate the log-ratios between sample A and B (the parameter is: a-b) Calculate y1-y2 for every gene Question 2: Which genes are differentially expressed in at least one of the samples relative to R. Calculate the F-statistic for every gene R