190 likes | 290 Views
M. Kathleen Kerr “ Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, 822-828; December 2003. Biostatistics Article Oncology Journal Club May 28, 2004. A couple introductory points. Different kinds of microarrays Two main distinctions
E N D
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies”Biometrics 59, 822-828; December 2003 Biostatistics Article Oncology Journal Club May 28, 2004
A couple introductory points • Different kinds of microarrays • Two main distinctions • One-color (e.g. Affymetrix, long oligo) • Two-color (e.g. spotted cDNA) • Some of the statistical tools are the same and some are different • Using two color arrays is slightly more complicated in terms of design
Statistics and Microarrays • Statistical Principles certainly apply to microarray analyses • We should be considering some of the same basic tenets when performing microarray studies • Randomization • Sample size/Replication issues • Experimental design • Good design is critical to making efficient and valid inferences.
Randomization • Might not sound applicable • But… • If you have a ‘treatment’ you are giving, samples should be randomly assigned to treatment groups • Randomize order in which samples are processed • Randomize order in which hybridizations are performed • Randomize the order in which arrays are chosen from array batch. • Example: Dosing study • Looking for genetic changes in cells as a function of dose • Perform all dose=0 experiments first, then dose=1, then dose=2, etc…. • But, as you proceed, you learn more, get better at processing samples, hybridizations, using scanner…. • Your results be associated with dose even if dose has no affect on genetic changes: CONFOUNDING!
Sample Size and Replication • Three types of ‘replication’ in microarrays A. Spotting genes multiple times on same array B. Hybridizing multiple arrays to the same RNA samples C. Using multiple individuals of a certain type • A and B are considered ‘technical’ replicates • C describes ‘random sampling’ from the population • THESE ARE CRITICALLY DIFFERENT!
Sample Size and Replication • Technical replication: • DOES NOT address biological variability • DOES address measurement error of assay • Usually, interested how a condition affects individuals in general • NOT usually interested in how a condition affects any given individual • Example: AML • Do we want to make inferences about differences in gene expression across AML subpopulations? • Or, do we want to make inferences about differences in gene expression in two particular AML patients, each of whom has a different type of AML?
Sample Size and Replication • Why/When would we be interested in technical replication? • Medical diagnosis • Need to know how precise the measures are • Sensitivity and specificity of the assay depend on that
Sample Size and Replication • Biological replicates • Tell us about the variability across samples of the same type. • Biological variability is critical for • finding differences in gene expressions across populations • Classification procedures which try to use gene expression patterns that differentiate individuals of different types • If you use just one sample or cell line to make inferences about the population of interest • You are making a BIG assumption: “Population is relatively homogeneous” • Cannot evaluate your assumption based on the data from the study.
Sample Size and Replication • For a fixed sample size: • It is preferable to sample NEW individuals rather perform technical replicates • Why? It is more efficient in terms of variance, power, etc. • You gain much less by replicates than new samples • But, if it is expensive to sample new individuals • Examples: samples are very rare, recruitment is difficult, procedure for acquiring samples is risky or expensive • In this case, it might be worthwhile to perform some technical replicates due to “cost-benefit” analysis • GENERAL RULE: TRUE REPLICATION BEATS TECHNICAL REPLICATION FOR GAINS IN PRECISION WHEN ESTIMATING PARAMETERS
Pooling of Samples • Often motivated by insufficient quantity of RNA, which is reasonable. • Sometimes, proposed to ‘control’ for biological variability • Bad idea! • We need to understand, not eliminate biological variability • To understand the differences in mean expressions across two populations (e.g. Normal karyotype and t(15:17)), we need to be able to estimate the populations means • We cannot do that if we have pooled RNA • We can estimate mean difference in two groups based on pooled samples • But, we cannot make inferences about whether of not there is a difference in mean expression.
Pooling of Samples • Pooling is ALWAYS bad if your goal is • Finding classification scheme • Discovering unknown subtypes • ‘In between’ strategy for pooling when we are interested in determining if average expression is different in two phenotypes (Kendziorski et al (2003)). • Pooling RNA for use as a ‘reference’ is OK (more in a minute).
Experimental Layout • Discussion specific to two-color arrays • Complicated due to pairing of samples on arrays • One-color array design considerations usually more straightforward • Critical determinant of design efficiency. • Three main types of designs in two-color arrays: • Reference • Loop • Dye swap
Reference Design Type 1 • Each arrow represents an array • Lets say that origin of arrow is green and head of arrow is red • Each sample of interest is paired with the same “reference” sample • AML example: reference was 11 pooled cell lines • Here, each sample is labeled with red (Cy5) and reference is labeled with green (Cy3) • Each sample is only hybridized to ONE array (each reference) Type 2 Reference sample
Loop Design Type 1 • Each sample is paired with a sample of the other type (no reference!) • Each sample is hybridized to TWO arrays and is both red and green • Can compare any two arrays by comparing arrays between them in loop. • Relative efficiency is 4 to 1 comparing loop to reference • Downside: what if just ONE array goes bad? Loop is not a loop anymore! • Good design for small number of samples: uses information very effectively Type 2
Dye Swap Design Type 1 • Each sample is paired with the same sample of the other type TWICE • Each sample is hybridized to TWO arrays • Dyes are swapped • Relative efficiency is 4 to 1 comparing loop to reference • More robust than loop • Less complicated than loop • Direct comparisons are not as easy because samples are not linked through other samples as in other two designs Type 2
Why reference so often? • As population variance increase, loop and dye swaps have less advantage. • Sample comparisons must go ‘through’ loop • Direct comparisons not easy in dye swap if samples are not on same chip. • If you have large number of samples, loop is risky due to ‘bad chips’ • Logically, however, by using reference on every chip, we are ‘wasting’ a resource. • But, less efficiency advantage in complex designs as number of RNAs increases
Robustness • Two robust alternatives: require 2x as many arrays “Double reference” “Double Loop”
Practical Considerations • Simplicity • Large study with many technicians • Extendability • Open-ended • Can add additional samples at a later time depending on what early results suggest • Reference and “symmetric” reference designs • Useful subdesigns • “subgroup analyses” • Example: all AMLs vs. normal karyotype