520 likes | 693 Views
Design of microarray gene expression profiling experiments. Peter-Bram ’ t Hoen. Lay-out. Practical considerations Pooling Randomization One-color vs Two-colors Two-color hybridization designs Ratio-based vs Intensity-based analysis. Think before you start. research question
E N D
Design of microarray gene expression profiling experiments Peter-Bram ’t Hoen
Lay-out • Practical considerations • Pooling • Randomization • One-color vs Two-colors • Two-color hybridization designs • Ratio-based vs Intensity-based analysis
Think before you start • research question • choice of technology • controls and replicates Ref: Churchill. 2002. Nature Genetics Supplement 32: 490-495
Research question • Limit your (initial) number of question / conditions • choose best timepoint for mRNA regulation • can be different from protein/activity • pilots using RT-qPCR • experimental follow-up • what will you do with the data? • verification of differential gene expression • in vitro experiments to study mechanism • "in vivo" verification in tissue sections
Choice of technology • What is affordable? • Do a pilot to estimate the variance for your samples, experimental set-up and platform • Calculate your power: What is the lower border of the effect size that you can pick up?
Controls • positive: genes whose regulation is known • check on biological experiment & data analysis • positive: spikes in mRNA and/or hyb mix • check labeling procedure and hybridization • detection range (sensitivity) and dynamic range • "landing lights" for gridding software • negative controls: non-specific binding • check cross-hybridization: buffer, non-homologous DNA
spike Reference RNA Test RNA … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Array containing DNA controls …… …… …… …… …… …… Spikes Spiked 2-fold change (copies/cell) Spiked 3-fold change (copies/cell) RCA Cab rbcL LTP4 LTP6 XCP2 RPC1 NAC1 TIM PRK 2 1 100 50 3 1 150 50 10 5 60 30 15 5 60 20 300 150 300 100 cDNA probe synth. & hybridize
Spikes Van de Peppel et al. EMBO Reports 4, 387 (2003)
Controls • positive: genes whose regulation is known • check on biological experiment & data analysis • positive: spikes in mRNA and/or hyb mix • check labeling procedure and hybridization • detection range (sensitivity) and dynamic range • "landing lights" for gridding software • negative controls: non-specific binding • check cross-hybridization: buffer, non-homologous DNA
Replicates • Include sufficient replicates, based on pilot experiment • Biological replicates are preferred over technical replicates • Control experimental variables with possible unintended effects • genetic background • gender • age
Randomization • Randomize samples with respect to experimental influences • experimenter • day of hybridization • batch of arrays • dye • etc
Pooling • Often done because of lack of sufficient amounts of RNA, but good amplification protocols are available • Advantages: • dampening of individual variation, may increase statistical power • Generally not recommended: • outliers in the population may result in large and significant effects • information on the differences in the population is lost and is probably biologically relevant • in fact, it is an artificial way to increase the significance of your findings
Hybridization design • One color: not many difficulties expected • Two color: what to hybridize with what in which color? • Reference design • Paired design • Loop design • Mixed design Read: Yang & Speed (2002). Design issues for cDNA microarray experiments. Nature Reviews Genetics 3, 579-588
Hybridization design: general issues • Comparisons on the same array are more precise than comparisons on different arrays • Identify most important comparisons • Hybridize those on the same slide • Dye swap • A dye-effect is always there • Balance designs with respect to dye (exception: some common reference designs)
Common reference vs direct hybridizations • Direct • Common reference Variance[ log(A/B) ] for slide = s2 then the variance of the average of the two measurements is s2 /2 B A A log(A/B) = log(A/R) – log(B/R) and variance of log(A/B) is variance[ log(A/R) ] + variance[ log(B/R) ] = s2 + s2 = 2 s2 R B
More samples • Loop Reference 6 arrays A A R B B C C Log (A/B) = 2/3 log (A/B) + 1/3 {log (A/C) – log (B/C)} Assuming that all variances are equal Variance [ log(A/B) ] = 4/9 (s2 / 2) + 1/9 (s2) = 1/3 s2 Variance [ log(A/B) ] = Variance [ log(A/C) ] = Variance [ log(B/C) ] = 0.5s2 + 0.5s2 = s2
Common reference vs direct hybridizations Theoretical Considerations • A design is optimal when it minimizes the variance of the effect of interest • Look for designs leading to small variance of log(A/B) Practical considerations • Common reference may be desired when experiment is extended in the future or when a lot of different conditions have to be compared • Choose a biologically relevant common reference (say: your control sample). In that case, your ratios are of interest and better interpretable
Time-course designs Take 4 time points T1 T2 T3 T4 The best choice of design depends on the comparisons of interest and on the number of slides available
Time-course designs Using 3 slides: T1 T2 T3 T4 which is the best to estimate changes relative to the initial time point: T2 / T1, T3 / T1, T4 / T1
Time-course designs • Using 3 slides: T1 T2 T3 T4 which is the best to estimate relative changes between successive time points: T2 / T1, T3 / T2, T4 / T3
Time course designs • Using 4 slides: T1 T2 T3 T4 R which is the reference design; All comparisons have equal precision
Time course design • Using 4 slides: T1 T2 T3 T4 which is the loop design, balanced wrt dye Distant comparisons have lower precision
Time course designs • Using 4 slides: T1 T2 T3 T4 also uses exactly 2 hybridizations per treatment, balanced wrt dye. Most precise estimates: 1/2, 1/3, 2/4, 3/4
Factorial designs • Designs for studies which involve factors as explanatory variables • Age group • gender • Cell line • Tumor types
Factorial designs Glonek & Solomon (2004) • Admissible design: using the same number of arrays, there are no other designs yielding smaller variances of all parameters Glonek et al.Biostatistics5, 89-111 (2004)
Factorial design; example • Time • 0h • 24h • Cell lines • I (non-leukaemic) • II (leukaemic) • Find genes diff. expressed at 24 but not at 0: interaction between time and cell line
Factorial design; possible samples • All combinations of factor levels. In this case, 4 are possible:
Factorial design: analysis model • (log-)linear model is used • experimental conditions correspond to parameter combinations as in:
Factorial design; possible arrays (2) I,24 I,0 (3) (6) (4) (1) II,0 II,24 (5)
Optimal admissible design • Designs that are not worse than others, and for which the variance of the parameter of interest is (one of the) smallest • In the example: wish to find admissible designs for which the interaction term has one of the smallest variances
Optimal admissible design Glonek et al.Biostatistics5, 89-111 (2004)
Factorial designs: conclusions • Design with all pairwise comparisons is not the best in this case • Best design can only be found with respect to a model • if model does not fit the data well, design choice may not be the best • make sure model chosen is adequate
How to compare efficiently many different conditions? • Common reference: not efficient • Loop and mixed designs: not all comparisons have equal precisions GA Churchill, Nat Genet. 2002 Dec;32 Suppl:490-5
Possible solution • Randomized design • Intensity-based rather than ratio-based calculations • Requires: • Hybridization of two samples independent; no competition for binding sites • Absence of large spot and array effects • To be tested for each platform
Our favourite platform • Spotted collection of 65-mer oligonucleotides (Sigma-Compugen collection) • 22K
Design used to demonstrate independent hyb ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
Distribution of signal intensities is similar ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
R > 0.95 0.90 < R < 0.95 R < 0.90 Correlation of intensities is high ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
Effect of addition of unlabelled target Two targets on microarray Single target on microarray ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
Correlation of ratios calculated from different hyb designs ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
Intensity-based analysis • Hybridizations of two targets on the array are independent • No saturation and no competition • Intensity readings show high inter-array correlation • Comparisons on the same array have highest precision and all other comparisons have equal precision ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
Example of randomized design • Mouse models for muscular dystrophy Turk et al. FASEB J 20, 127-129 (2006)
Our design • Randomly assign samples to the arrays, avoiding co-hybridization of sample from the same group • 2 biological replicates • 4 technical replicates (dye-swap + replicate spotting) Turk et al. FASEB J 20, 127-129 (2006)
Intensity-based analysis can go wrong Vinciotti et al. Bioinformatics 21:492-501 (2005)
Intensity-based analysis can go wrong Vinciotti et al. Bioinformatics 21:492-501 (2005)
Some guidelines • First determine the main question, pointing out the effect of interest • log[A/B] • Then choose analysis model, so that effect variance can be computed • VAR { log[A/B]} • Practical constraints: amount of RNA available, number of hybridizations, number of slides • A good design measures the effect of interest as accurately as possible • small VAR { log[A/B] }
Some useful links • http://dial.liacs.nl/Courses/CMSB%20Courses.html • http://www.brc.dcs.gla.ac.uk/~rb106x/microarray_tips.htm • http://exgen.ma.umist.ac.uk/course/notes/WitDesignLecture.pdf • http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
Acknowledgements Human and Clinical Genetics, LUMC Judith Boer Renée de Menezes Rolf Turk Ellen Sterrenburg Johan den Dunnen Gertjan van Ommen Microarray facility: Leiden Genome Technology Center
Case study • Two genetically-modified zebrafish strains and one wild-type • Defects mainly in muscle development • Apparent at 12-48 hours of development; early death • Question: which biological pathways are affected and responsible for defective myogenesis?