1 / 79

Microarray Pre-Processing

Microarray Pre-Processing. Mark Reimers CSHL Data 2012. Outline. Microarray technologies Quality assessment Background Normalization Other normalization issues Summarization of Affymetrix. Microarray Technologies. Outline. Library preparation Hybridization cDNA expression arrays

ince
Download Presentation

Microarray Pre-Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarray Pre-Processing Mark Reimers CSHL Data 2012

  2. Outline • Microarray technologies • Quality assessment • Background • Normalization • Other normalization issues • Summarization of Affymetrix

  3. Microarray Technologies

  4. Outline • Library preparation • Hybridization • cDNA expression arrays • Oligo expression arrays • Agilent • Affymetrix • Illumina • NimbleGen • Other array types

  5. Hybridization • Microarrays measure the abundance of DNA or RNA by relative hybridization

  6. Hybridization

  7. Preparing a cDNA/RNA Library from mRNA • Reverse transcribe cDNA from RNA • Fragment • Amplify cDNA • OR • Use cDNA to transcribe RNA

  8. Glass Slide Microarrays (1994)

  9. Printing Glass Slide Arrays

  10. Synthetic Oligonucleotide Arrays Up to 25 bases

  11. Affymetrix Probes Schematic Actually Probes are (pseudo-) randomized

  12. Affymetrix Probe Sets • Probes for older expression arrays are drawn from the 3’ end of the gene • Poly-T priming picks up poly-A tails of transcripts • Newer exon and whole-gene arrays have probes evenly distributed • Random priming more even – but not uniform!

  13. Printed Oligonucleotide Arrays • Agilent (off-shoot of HP) uses printing technology

  14. Agilent Arrays • Now second largest supplier of arrays • Reputation for high quality and attention to detail (e.g. scanner optics) • Typical 60 nucleotide probes (60-mers) • 44K, 185K, and 244K standard sizes • Can do several (up to 8) arrays per slide

  15. NimbleGen Oligonucleotide Arrays Nimblegen uses a micro-mirror method to de-protect during oligo synthesis in situ

  16. (Roche-) NimbleGen Arrays • Usually 60-mers • Random sequence controls provided • Standard sizes from 385K up to 2.1 million probes • Can also be multiplexed • Patent issues kept the production facility in Iceland

  17. Illumina Bead Arrays • 3 mm beads manufactured with identifying segment (~12 nt) and 50-mer probe for target • Beads in wells (for some assays with optical fiber) • First scan reads ID tag; second reads target

  18. Illumina Probes • Typically about 30 beads per array • SD very high • No controls on most arrays • Can be multiplexed

  19. Microarray Quality Assessment

  20. Quality Assessment • You are going to be doing a lot of intense analysis on expensive data • Are there any factors that would lead you to doubt or distrust a particular datum (array) ? • Quality of library – e.g. RNA quality • Quality of hybridization process • Statistical QA – try to detect non-random technical variation on any chip

  21. RNA Quality Ideal: Two sharp peaks for 18S & 28S RNA Agilent BioAnalyzer

  22. Statistical Approaches • Aim: are any samples different from others in technical preparation? • Exploratory Data Analysis (EDA) • Box plots, density plots, clustering, PCA • Are there any outliers? • These could be biologically interesting • Are there associations with technical factors? • Technician; date of sample prep; etc.

  23. EDA - Boxplots • Boxplot of 16 chips from Cheung et al Nature 2005

  24. Another Portrait - Densities

  25. Each Pair Replicates One Sample • Boxplot of 16 chips from Cheung et al Nature 2005

  26. Some Causes of Technical Variation • Amount of RNA in sample differs always • Yield of conversion to cDNA or cRNA may differ • Label incorporation may differ • Temperature of hybridization may differ • RNA may be slightly degraded in some samples • Strength of ionic buffers differs • Stringency of wash differs • Scratches may occur on some chips • Ozone may bleach Cy5 at some times

  27. Borrow an Idea from Model Testing • Question: Is the model adequate? Or do hidden factors cause systematic errors? • Examine residuals after fitting model • Should be IID Normal • Is there structure in residuals? • Plot against known technical covariates, such as order of sample • How to adapt residual examination for high-throughput assays?

  28. Statistical QA for Arrays • Model for signal of probe i on chip j: yij ~ mi + eij • Each gene has same mean in all arrays (mostly true) • Look at residuals after fitting model • New twist for high-throughput assays: • Examine residuals within each chip (fix j; vary i) • Plot against known technical factors of probes • Is there any factor that seems to be predicting systematic errors?

  29. Statistical QA of Arrays • Significant artifacts may not be obvious from visual inspection or bulk statistics • General approach: plot deviations from average or residuals from fit against any technical variable: • CG content or Tm (thermodynamics) • Probe position relative to 3’ end of gene (for poly-T primed RNA) • Physical location on chip (fluid artifacts) • Average Intensity across chips (saturation)

  30. Saturation Decreasing rate of binding of RNA as more RNA occupies the probe Quenching: Light emitted by one dye molecule may be re-absorbed by a nearby dye molecule; then lost as heat Effect proportional to square of density Ratio vs Intensity Plots Reveal Saturation & Quenching Plot of log ratio against average log intensity across chips GSM25377 from the CEPH expression data GSE2552

  31. How Much Variability on R-I? • Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005)

  32. RNA Quality Plots in Bioconductor • affyRNAdeg plots in affy package • Effects do not appear large because averaged • Samples with RNA quaility differences stick out Plot of average intensity for each probe position across all genes against probe position

  33. Local Bias on Affymetrix Chips Image of raw data on a log2 scale shows striations but no obvious artifacts Image of ratios of probes to standard shows a smudge Non-coding probes Images show high values as red, low values as yellow

  34. Spatial Artifacts on Affy Chips Bubbles (yellow) in hybridization chamber Touching cover slip and wiping incompletely Scratches on cover slip

  35. Model-Based QC for Affy in BioC • Robust Multi-chip Analysis (RMA) • fits a linear model to each probe set • High residuals show regional patterns High residuals in green See http://plmimagegallery.bmbolstad.com/ Available in affyQCReport package at www.bioconductor.org

  36. Affy QC Metrics in Bioconductor • affyPLM package fits probe level model to Affymetrix raw data • NUSE - Normalized Unscaled Standard Errors • normalized relative to each gene • How many big errors?

  37. Spatial Artifacts in Agilent • Usually artifacts are not as strong as on other array types • BUT – consequential because only one probe per gene • More diffuse artifacts are common • probably reflecting wash irregularities

  38. Bioconductor arrayQuality Package

  39. Background Estimation Mark Reimers

  40. General Issues in Estimating and Compensating Background • ‘Background’ is heterogeneous – different genomic regions or probes have very different background levels • Most are comparable and a few are high

  41. Microarray Background • Non-specific hybridization • Cross-hybridization to specific non-targets • Distribution of Background has outliers • High CG more variable than low

  42. Current Model for Background Estimation • 25-mers are prone to cross-hybridization • MM > PM for about 1/3 of all probes • Cross-hybridization varies with GC content • Bases at ends matter less than central • Signal intensity varies with cross-hybe • Simple approach is linear model: mj,k are mean effects of base j at position k

  43. Estimate non-specific binding using either: True null assay (non-homologous RNA) Estimates from MM Rather than fit 25 independent coefficients fit spline with 5 df for each base Process background first; then normalize and fit model The gcRMAApproach to Background Correction Typical coefficients fit for each base at each position in the gcRMA background model (using 5df splines to model each base curve

  44. Evaluating the gcRMA Model • We compared RNA-Seq data to microarray data on the same samples to identify genes that were not expressed; therefore all signal is cross-hybridization for these probes • We fit the gcRMA model to those probes • The model explained less than 10% of the variance among probes

  45. Evaluating gcRMA • gcRMA won on AffyComp data sets (2006) using replicates with 14 spike-ins done by Affy • Many investigators get bad results (and don’t write it up) • Gharaibehet al.BMC Bioinformatics 2008 9:452 claimed that gcRMAdoes very well on highly expressed genes, not nearly so well on less expressed genes • That’s precisely where it doesn’t matter

  46. Why Does gcRMA Fail? • gcRMA estimates cross-hybridization by fitting regression to MM probes • MM probes contain a good deal of specific signal • Symptom: gcRMA curves are almost identical for different chips, but cross-hybe varies considerably between chips assessed by other means (e.g. comparing controls or fitting the gcRMA model to genes known to be absent)

  47. Does Cross-Hybridization Matter for Long Oligos? • Variation in GC content is more constrained • Cross-hybridization seems much more uniform • Too hard to estimate individual effects of bases • Model using quadratic curve to estimate distributions of bases over length is effective at reducing error • Three terms: constant, linear, quadratic

  48. Background Varies Across Long-Oligo Arrays

  49. Microarray NormalizationWidely-Used Methods

  50. Common Normalization Methods • Simple parametric methods • Align mean or median intensities • Match mean/median and SD/MAD • Nonparametric methods • Lowess for two-color arrays • Align an‘Invariant Set’ across arrays • ‘Shoehorn’ all samples to a common distribution

More Related