1 / 22

Practical Issues in Microarray Data Analysis

Practical Issues in Microarray Data Analysis. Mark Reimers National C an c er Ins t i t u t e Bethesda Maryland. Overview. Scales for analysis Systematic errors Sample outliers & experimental consistency Useful graphics Implications for experimental design Platform consistency

enya
Download Presentation

Practical Issues in Microarray Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland

  2. Overview • Scales for analysis • Systematic errors • Sample outliers & experimental consistency • Useful graphics • Implications for experimental design • Platform consistency • Individual differences

  3. Distribution of Signals • Most genes are expressed at very low levels • Even after log-transform the distribution is skewed • NB: Signal to abundance ratio NOT the same • for different genes on the chip

  4. Explanation of Distribution Shape • Left hand steep bell curve probably due to measurement noise • Underlying real distribution probably even steeper + = abundances + noise = observed values

  5. Variation Between Chips • Technical variation: differences between measures of transcript abundance in same samples • Causes: • Sample preparation • Slide • Hybridization • Measurement • Individual variation: variation between samples or individuals • Healthy individuals really do have consistently different levels of gene expression!

  6. Replicates in True Scale • Signals vary more between replicates at high end • Level of ‘noise’ increases with signal Std Dev as a function of signal across all chips Comparison of chips (Affy) chip 1 SD chip 2 mean signal Red line is lowess fit

  7. Replicates on Log Scale • Measures fold-change identically across genes • Noise at lower end is higher in log transform chip 1 vs chip 2 after log transform SD vs signal after log transform

  8. Ratio-Intensity (R-I) plots • Log scale makes it convenient to represent fold-changes up or down symmetrically • R = log(Red/Green); I = (1/2)log(Red*Green) • aka. MA (minus, add) plots (log) Ratio (log) Intensity

  9. Variance Stabilization • Simple power transforms (Box-Cox) often nearly stabilize variance • Durbin and Huber derived variance-stabilizing transform from a theoretical model: • y = a (background) + m eh (mult. error) + e (static error) • m is true signal; h and e have N(0,s) distribution • Transform: • Could estimate a (background) and sh/se empirically • In practice often best effect on variance comes from parameters different from empirical estimates • Huber’s harder to estimate

  10. Box-Cox Transforms • Simple power transformations (including log as extreme case), eg cube root • Often work almost as well as variance-stabilizing transform

  11. Should you use Transforms? • Transforms change the list of genes that are differentially regulated • The common argument is that bright genes have higher variability • However you aren’t comparing different genes • Log transform expands the variability of repressed genes • Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers) • Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology)

  12. Graphical methods • Aims: • Exploratory analysis, to see natural groupings, and to detect outliers • To identify combinations of features that usefully characterize samples or genes • Not really suitable for quantitative measures of confidence • Principal Components Analysis (PCA) • Standard procedure of finding combinations with greatest variance • Multi-dimensional scaling (MDS) • Represent distances between samples as a two- or three-dimensional distance • Easy to visualize

  13. MDS Plots

  14. Representing Groups Day 1 Chips Cluster diagram Multi-dimensional scaling

  15. Different Metrics – Same Scale • 8 tumor; 2 normal tissue samples • Distances are similar in each tree • Normals close • Tree topologies appear different • Take with a grain of salt!

  16. Volcano Plot • Displays both biological importance and statistical significance log2(p-value) or t-score log2(fold change)

  17. Quantile Plot • Plot sample t-scores against t-scores under random hypothesis • Statistically significant genes stand out Sample t-scores Corresponding quantiles of t-distribution

  18. Systematic Variation • Intensity-dependent dye bias due to ‘quenching’ • Stringency (specificity) of hybridization due to ionic strength of hyb solution • How far hybridization reaction progresses due to variation in mixing efficiency • Spatial variation in all of the above

  19. Relevance for Experimental Designs • Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill • Assuming error is random! • In practice very hard to deal with systematic errors in a symmetric design • No two slides with comparable fold-changes Sample 1 Sample 5 Sample 2 Sample 4 Sample 3

  20. Critique of Optimal Designs • Optimal for reduction of variance, if • All chips are good quality • No systematic errors – only random noise • In fact systematic error is almost as great as random noise in many microarray experiments • With loop designs single chip failures cause more loss of information than with reference designs

  21. Individual Variation • Numerous genes show high levels of inter-individual variation • Level of variation depends on tissue also • Donors, or experimental animals may be infected, or under social stress • Tissues are hypoxic or ischemic for variable times before freezing

  22. Frequent False Positives • Immuno-globulins, and stress response proteins often 5-10X higher than typical in one or two samples • Permutation p-values will be insignificant, even if t-score appears large Group 1 Group 2 frequency gene levels

More Related