600 likes | 614 Views
This toolbox provides an overview of quality control techniques, statistical methods, and experimental designs for gene expression microarray analysis. Topics covered include quality control charts, false discovery rate, principal components analysis, data normalization and transformation, and analysis of Affymetrix probesets. The importance of replication and statistical methods in gene expression analysis is emphasized, along with strategies for analyzing complex study designs. The toolbox also covers the interpretation of gene lists and the calculation of false discovery rate. Principal components analysis is described as a technique for visualizing gene expression data and identifying important expression patterns. The toolbox includes examples, visualizations, and guidelines for analyzing gene expression microarray data.
E N D
MSCL Analyst’s Toolbox, Part 2 Instructors: Jennifer Barb, Zoila G. Rangel, Peter Munson March 2007 Mathematical and Statistical Computing Laboratory Division of Computational Biosciences
Statistical topics • Quality Control Charts • False Discovery Rate • Principal Components Analysis explained • PCA Heatmap • Data normalization, transformation • Affymetrix probesets and “Probe-level” analysis • MAS5, RMA, S10 compared
Gene Expression Microarrays • Started in mid-1990s, exponential growth in popularity • High-throughput -- measures 10,000s of genes at once • Very noisy -- systematic and random errors • Chip manufacturing, printing artifacts • RNA sample quality issues • Sample preparation, amplification, labeling reaction problems • Hybridization reaction variability • Linearity of response, saturation, background • Affymetrix has controlled chip quality well. • REPLICATION IS STILL REQUIRED! • Statistical methods are critical in analysis! • Quality Control is Essential!
New Scanner Installed Scanner “burn-in”? Quality Control Plotsfor Parameters RawQ, ScaleFactor
Experimental Designs for Gene Expression • Cross-sectional clinical studies from 2 or more patient groups or tissues; identify markers, prognostic indicators. • Animal model: samples compared between treatments, groups, or over time; identify genes involved in disease process. • Intervention Trial: collect blood samples pre/post treatment or over time, identify (and rationalize) genes involved. • Cell culture: Treat cells in culture, identify genes and patterns of response. Complex study designs possible. • Genetic Knock-out: Perturb genotype, give treatment, investigate expression response, in animal or cells.
Gene Expression Analysis Strategies • Clinical Studies: • Exploratory analysis, Hierarchical Cluster, Heat maps • Sample size often insufficient • Two-sample tests, Discriminant Analysis, “machine learning” approaches to find prognostic factors • Designed studies: Analysis plan should follow design • T-tests, one-way ANOVA to select significantly changing genes • Blocking to account for experimental batch • Two-way ANOVA for complete two-factor experiments • Regression (etc.) for time-course experimemts • Corrections for multiple-comparison (20,000 genes tested) • False Discovery Rate • Interpretation of gene lists (open-ended problem!)
True discoveries False discoveries Cut at p<.05 P-values should be uniformly distributed • Note excess of small p-values in 45,000 probe sets • Indicates presence of significant, differentially expressed genes
Expected Number of False Discoveries FDR* = Number Discovered (Number of tests) x p-value cutoff = Number Discovered at this p-value 12 12,000 * .001 = 25% FDR = = 48 48 False Discovery Rate calculation(simplified version) Example: 48 genes detected at p<.001 in chip with 12,000 genes. *Benjamini, Y., Hochberg, Y. (1995) JRSS-B, 57, 289-300.
False Discovery Rate calculation(full version) Now we have guarantee that,
1 Samples n 1 12,625 Genes Gene Expression Data Matrix, X(transpose of “Final File” format) Annotations for each Gene Expression Matrix, X Information about each Sample
Analyzing the Data Matrix • "pre-condition" the Expression Data Matrix • Select "significant" Genes (False Discovery Rate) • Select relevant Samples (Outlier rejection, QC) • Re-order, partition the Genes ("clustering") • Re-order the Samples • Visualize the matrix ("heat-map", PCA scatterplot), encode Gene and Sample annotations • Visualize by Sample (rows of X, scatterplots, line plots) • Visualize by Gene (cols of X) • Visualize the Annotations (how?) • Browse the display for new hypotheses!
Principal Component Analysis Each Principal Component is an orthogonal, linear combination of the expression levels. For the ith gene chip: In matrix notation: Principal Components Matrix Patterns Matrix Expression Data Matrix
Or Data can be Reconstructed from PCs! A was chosen so that AAT is the Identity matrix:
Genes Components Genes 1 12,625 n 1 1 12,625 1 1 1 PC * = Experiments Experiments Components X EP n n n Plot PC(i,1) vs PC(i,2) for each experiment Data Matrix (X) equals Principal Components (PC) times Expression Patterns (EP = AT) • EP row1 contains most important “expression pattern" • PC col 1 defines how that pattern is manifest in each experiment • Similarly for EP row 2, PC col 2, etc. • Only a few patterns needed to reconstruct data matrix X
Principal Components Analysis PC 2(12%) PC 1(38%)
GLOBAL DATABASE (HG U95A)PCA BI-PLOT Each spot is one chip N=469
Genes Components Genes 1 12,625 n 1 1 12,625 1 1 1 PC * = X Experiments Experiments Components EP n n n Visualize coefficients of a first few “Patterns”, Re-order Experiments PCA HEATMAPData (X) equals Components (PC) times Expression Patterns (EP)
U95A DatabasePCA Heatmapcolored bySample Type (12) Conclusion: Sample Type and Project determine clusters
PCA Heatmap of Entire Database 469 Chips, 468 Components5,933,750 values!
Chip-to-chip normalization,Data transformation • Signal intensity varies chip-to-chip for a variety of technical reasons. • Scale adjustments can be made in variety of ways. • Median adjustment (divide by col median) is commonly used • Other quantiles (e.g.75th percentile) may work better • Log-transform • spreads data more evenly • makes variance more uniform • “Lmed” is median normalized, log transform
Chip-to-chip normalization,Data transformation (2) • Quantile normalization (“ranking” the data): every percentile becomes identical across chips • Quantile normalization may remove technical artifacts (e.g. curvature) • Variance should be homogeneous across measurement scale • Variance may be “homogenized” with appropriate transform (e.g. logarithm, square-root, arcsinh) • “S10” transform -- optimal variance stabilizing, quantile normalizing transform, calibrated to match Log10 over central part of measurement scale
2 Comparison of two chips - Lmed(SG) Note deviation from line of identity
2 Comparison of two chips - 2 x limits • Note deviation from line of identity • Note nonuniform variance
Median-normalized Log-transform“Lmed” • Adequate in most cases BUT…. • Some nonlinearity may remain, requiring further normalization • Variance is not truly constant, expands at low intensities • Cannot treat zero or negative values • Logarithm may not be best transformation • Median normalization may not always be adequate
Variance Stabilizing Transform (3) Symmetric Adaptive Transform (S10): • We start with quantile normalization to convenient distribution • We further transform to make variance constant with mean • We adapt transform to empirical variance model (with experiment with at least 5 to 10 chips) • We scale transform to match log10 units midrange • We require symmetry around origin
2 Comparison of two chips - Lmed(Signal) Model the nonlinear relationship Red line is plot of quantile of chip 1 vs quantile of chip 2
2 Comparison of two chips - Quantile normalization • Second chip is quantile-normalized to first chip • Curvature is cured! • Now, can we remove the variable spread? • Nonuniform variance?
2 Comparison of two chips -Symmetric Adaptive Transpose, base10 “S10” • Uses Quantile normalization • Gives better fit to line of identity • Adapts scale to give homogeneous variance • Uniform scatter about line • Calibrated to match Log10 in middle of scale • *Munson, P.J. A consistency test for determining the significance of gene expression changes on replicate samples and two convenient variance-stabilizing transformations. in GeneLogic Workshop of Low Level Analysis of Affymetrix GeneChip Data. 2001. Bethesda, MD.
Lmed Symmetric Adaptive Transform (“S10”) S10
PCA on Lmed transformed data • 12 Chips • 3 Groups • Two apparent outliers • Groups not well separated • 1st PC explains 15.3% of variation
PCA on S10 transformed data • Outliers no longer obvious • Groups well-separated • 1st PC explains 30.8% of variation
Fold Change due to Drug - Log10 scale LFC - Repl. 2 Log Fold Change-Drug vs. Control - Repl. 1
SFC - Repl. 2 SFC-Drug vs. Control - Repl. 1 Fold Change due to Drug - S10 scale
Log of “Signal”, Variance Model Lmed Transform Value Std Dev Lmed Mean Lmed Value Signal Value
S10(“Signal”), Variance Model S10 Transform Value Std Dev S10 Mean S10 Value Lmed Transform Value
“Probe Level” analysis Comparison of Signal, RMA, S10