220 likes | 395 Views
Practical Issues in Microarray Data Analysis. Mark Reimers National C an c er Ins t i t u t e Bethesda Maryland. Overview. Scales for analysis Systematic errors Sample outliers & experimental consistency Useful graphics Implications for experimental design Platform consistency
E N D
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland
Overview • Scales for analysis • Systematic errors • Sample outliers & experimental consistency • Useful graphics • Implications for experimental design • Platform consistency • Individual differences
Distribution of Signals • Most genes are expressed at very low levels • Even after log-transform the distribution is skewed • NB: Signal to abundance ratio NOT the same • for different genes on the chip
Explanation of Distribution Shape • Left hand steep bell curve probably due to measurement noise • Underlying real distribution probably even steeper + = abundances + noise = observed values
Variation Between Chips • Technical variation: differences between measures of transcript abundance in same samples • Causes: • Sample preparation • Slide • Hybridization • Measurement • Individual variation: variation between samples or individuals • Healthy individuals really do have consistently different levels of gene expression!
Replicates in True Scale • Signals vary more between replicates at high end • Level of ‘noise’ increases with signal Std Dev as a function of signal across all chips Comparison of chips (Affy) chip 1 SD chip 2 mean signal Red line is lowess fit
Replicates on Log Scale • Measures fold-change identically across genes • Noise at lower end is higher in log transform chip 1 vs chip 2 after log transform SD vs signal after log transform
Ratio-Intensity (R-I) plots • Log scale makes it convenient to represent fold-changes up or down symmetrically • R = log(Red/Green); I = (1/2)log(Red*Green) • aka. MA (minus, add) plots (log) Ratio (log) Intensity
Variance Stabilization • Simple power transforms (Box-Cox) often nearly stabilize variance • Durbin and Huber derived variance-stabilizing transform from a theoretical model: • y = a (background) + m eh (mult. error) + e (static error) • m is true signal; h and e have N(0,s) distribution • Transform: • Could estimate a (background) and sh/se empirically • In practice often best effect on variance comes from parameters different from empirical estimates • Huber’s harder to estimate
Box-Cox Transforms • Simple power transformations (including log as extreme case), eg cube root • Often work almost as well as variance-stabilizing transform
Should you use Transforms? • Transforms change the list of genes that are differentially regulated • The common argument is that bright genes have higher variability • However you aren’t comparing different genes • Log transform expands the variability of repressed genes • Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers) • Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology)
Graphical methods • Aims: • Exploratory analysis, to see natural groupings, and to detect outliers • To identify combinations of features that usefully characterize samples or genes • Not really suitable for quantitative measures of confidence • Principal Components Analysis (PCA) • Standard procedure of finding combinations with greatest variance • Multi-dimensional scaling (MDS) • Represent distances between samples as a two- or three-dimensional distance • Easy to visualize
Representing Groups Day 1 Chips Cluster diagram Multi-dimensional scaling
Different Metrics – Same Scale • 8 tumor; 2 normal tissue samples • Distances are similar in each tree • Normals close • Tree topologies appear different • Take with a grain of salt!
Volcano Plot • Displays both biological importance and statistical significance log2(p-value) or t-score log2(fold change)
Quantile Plot • Plot sample t-scores against t-scores under random hypothesis • Statistically significant genes stand out Sample t-scores Corresponding quantiles of t-distribution
Systematic Variation • Intensity-dependent dye bias due to ‘quenching’ • Stringency (specificity) of hybridization due to ionic strength of hyb solution • How far hybridization reaction progresses due to variation in mixing efficiency • Spatial variation in all of the above
Relevance for Experimental Designs • Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill • Assuming error is random! • In practice very hard to deal with systematic errors in a symmetric design • No two slides with comparable fold-changes Sample 1 Sample 5 Sample 2 Sample 4 Sample 3
Critique of Optimal Designs • Optimal for reduction of variance, if • All chips are good quality • No systematic errors – only random noise • In fact systematic error is almost as great as random noise in many microarray experiments • With loop designs single chip failures cause more loss of information than with reference designs
Individual Variation • Numerous genes show high levels of inter-individual variation • Level of variation depends on tissue also • Donors, or experimental animals may be infected, or under social stress • Tissues are hypoxic or ischemic for variable times before freezing
Frequent False Positives • Immuno-globulins, and stress response proteins often 5-10X higher than typical in one or two samples • Permutation p-values will be insignificant, even if t-score appears large Group 1 Group 2 frequency gene levels