350 likes | 366 Views
Detailed workshop report on low-level analysis of Affymetrix GeneChip data, covering image analysis, background adjustment, expression indices, normalization, and quality assessment. Includes insights on short oligonucleotide arrays, RNA preparation, image gridding, and iterative algorithms.
E N D
Report onWorkshop on low level analysis of Affymetrix GeneChip data(Bethesda, Nov 19 2001, sponsor GLGC) Elisabetta Manduchi lab meeting November 29, 2001
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Short Oligonucleotide Arrays:a review See paper by Lockhart et al. (1996). • Preparing the array • covalently attached oligonucleotides chemically synthesized directly on a solid substrate • for each mRNA being monitored, a collection (probe set) of probe pairs (16 to 20) is synthesized on the array • each probe pair consists of two probe cells: one containing (millions of) copies of a given 25-mer that is a perfect match (PM) to a subsequence of the mRNA in question and the other containing copies of a companion (MM) 25-mer that has a single base difference in a central position.
Short Oligonucleotide Arrays:a review (cont.) • Preparing the RNA source • polyA RNA is converted to cDNA • cDNA is transcribed in vitro in the presence of fluorescently labeled (biotin or fluorescein) ribonucleotides, giving rise to labeled RNA • RNA is then fragmented with heat (fragment average size of 50 to 100 bp). • Hybridization occurs in a flow-cell. A brief washing step follows to remove un-hybridized RNA.
PM MM One probe setone transcript
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Image Analysis: gridding • Only one presentation on this: Harry Zuzan (GLGC, Duke University) • Plots of the CVs of probe cells ( after the Affy MAS 4.0 gridding) in grayscale image revelead a pattern • This motivated his attempt to improve the grid alignment, that is to estimate locations of probe cells and pixels belonging to each such cell
Iterative algorithm: • Estimates locations of probe cell centers • Scanned probe cell locations are modelled as a continuous deformation of a lattice • Improves estimates by balancing • minimizing the variance of pixel intensity near probe cell centers • maintaining local lattice structure of probe cell locations • Results: • Considerable improvement with respect to the MAS 4.0 software • Apparently the new MAS algorithm performs better than the earlier one in this respect
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Background Adjustment • Typically done at the probe cell level • Probe cell intensities, as output in the .CEL files, are computed by taking the 3rd quartile of pixels distribution in that cell, after excluding bordering pixels • Background is then subtracted before computing expression indices (=probe set values) • Background calculation methods: • MAS 4.0 • new MAS • F. Naef et al. (Rockefeller) • R. Irizarry (JHU, Speed’s group)
Background adjustments: MAS 4.0 • Divide array into sectors (16 by default). • Calculate, for each sector, the average of the lowest 2% probe cell intensities. This is the sector’s background. • Subtract the sector’s background from each probe cell intensity in that sector.
Background adjustments: new MAS • Do not have the details, but it looks like background is no longer constant on a sector • Apparently, for each cell, its distance from each sector center is computed and used to weight that sector’s background contribution to the cell’s background
Background adjustments: Naef et al. • Idea: bkg. is insensitive to MM and visible at low intensity • Select probes (probe cells?) such that |PM-MM|< (locally?) • use =50 (new) or 100 (old settings) • Estimate bkg. from the PM (?) distribution of these • Also gives a “trick” for dealing with negative values, after background subtraction
Naef et al.:other issues explored/raised • Probe set statistics: • He observed that from data from over 100 chips, roughly 30% of the probe cells have MM>PM • These are not concentrated at low intensities: 27% of probe pairs with MM>PM are in the top quartile • MM are not consistently behaving as expected • What about not using them? • The probe cell intensities within a probe set vary over decades • difficult to estimate probe set intensities using “averages” (MAS 4.0)
Background adjustments: Irizarry • Computes a global background by estimating the mode of the MM distribution • From an exploratory study he found that using global background improves on use of probe-specific MM • However they are working on a better background measure, more carefully designed and tested
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Expression Indices • The issue: how to derive, from the collection of PM and MM measurements in a probe set, a unique value representing that probe set (=transcript) intensity • Methods: • MAS 4.0 • New MAS • Li-Wong (LW) model based approach • SAFER (D. Holder et al., Merck) • Variance-stabilizing transformations (P. Munson, NIH) • Irizarry and Speed
Expression Indices:MAS 4.0 • Average Difference (AD) method with A the subset of probes for which dj=PMj-MMj are within 3 SDs away from the average of d(2), …, d(J-1)where d(j)is the j-th smallest difference. This is called Super-Olympic-Scoring (SOS) method. • A presence/absent/marginal call is also attached to every probe set, based on a decision matrix, which considers, among other things, the average log(PM/MM).
Expression Indices:new MAS (MAS 5.0) • Presented by Earl Hubbell (Affy) at the workshop and by a marketer here at Penn the following day • Motivation was to improve certain areas: • AD method is minimally robust against minority probes • Negative values are impossible for concentration of intensity and indicate that bias is larger than true effect • Incompatible with standard log-transformation • The algorithm was illustrated but with some vagueness in details; two papers should appear in January • New algorithm for P/A calls is incorporated
Outline: • Adjust PM for stray signal, where stray estimate=best of two estimates • The probe set intensity (Signal) is given by taking the Tukey biweight of log(PM/stray) • The Tukey biweight gives a smooth downweight of ouliers; it’s a weighed (by MAD) mean • Stray signal are typically estimated using the MM values, but anomalous MM values are handled with imputation • It appears as: • stray=MM, if physically possible • log(stray)=log(PM)-log(stray proportion), otherwise where stray proportion=max(SB, positive) (should this be log(stray proportion)?) SB=Tukey biweight(log(PM)-log(MM))
Expression Indices:Li and Wong model based approach • Full model (LWF) MMij=j+ij+ij PMij=j+ij+ ij +ij • Reduced model (LWR) PMij- MMij=ij +ij This is for a given probe set: i runs over the arrays and j over the probes in the set • The MBEI is defined as the ML estimate of the i, j is the sensitivity index of probe j
Comparisons of expression indices F. Wright et al., Ohio State University • Both theoretical and experimental comparisons Theoretical comparisons based on the assumption that the Li-Wong model is true • Estimator variances were compared: LWF outperformed LWR, which outperformed AD; LWF outperformed PM-only, which outperformed LWR • Empirical comparisons based on mixture experiments with replicates (6 per condition) to compare LWF, LWR, AD, and LA (Affy Log Ave) • Again model-based estimators seemed superior to simple averaging
Expression Indices:SAFER Scale matters, Additive Fits (probes and chips), Experimental-unit variability, Robustness and resistance • Trasform PM-MM data via a linear-log hybrid scale • Fit (median polish) probe-specific model using all chips log*(PMij-MMij))=chipi+probej+errorij
Expression Indices:variance-stabilizing transformations • Generalized Log Transform on AD (GLog(AD)) • Adaptive Transform of AD (TAD): plot Log(SD) vs mean of NQ(AD), fit (splines) smooth function g: T(X)=Int(-, X, 1/g) • Much debate on transforms followed…
Expression Indices: Irizarry and Speed • Summarize expression level of a probe set by Average log2(PM-BG) • PMs need to be normalized • Background computed as illustrated in slide 15 • If PM-BG<0 use minimum of positives divided by 2 • They strongly suggest: • Not to subtract or divide by MM (they grow with concentration, showing that they detect signal as well as non-specific binding) • Probe effect is additive on log scale • Take logs, no other transformations • Used Gene Logic spike-in and dilution studies to analyze AD, Li-Wong, and their own measure • All three performed well • AvLog(PM-BG) was arguably the best of the three
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Normalization • MAS 4.0 • Don’t have details about MAS new algorithm • A. Hill (Wyeth-Ayerst Research) • Li-Wong • SAFER (D. Holder et al., Merck) • B. Bolstead (Speed’s group) quantile normalization • Magnus Astrand (AstraZeneca)
Normalization:MAS 4.0 • Applied at the probe set (gene expression index) level • The trimmed (2%) mean intensity of all probe sets on the array is scaled to a constant target level • There are obvious situations in which the “constant mean assumption” may not be well supported • Chips monitoring a “small” fraction of transcriptome • Non-random gene selection on arrays (e.g. C. elegans A vs B/C) • …
Normalization: A. Hill • A purely spike-based normalization strategy (Frequency) • Add 11 biotin-labeled cRNA spikes to each hybridization cocktail • Construct a calibration curve • Use A/P calls for the spikes to estimate array sensitivity • Dampen AD signals below the sensitivity level to eliminate negative AD values • A hybrid normalization (Scaled Frequency) • Motivation: variation in cRNA “purity”, need to reduce spike skew • Method: • Define a set of arrays and compute ADs for all arrays • Pool spike responses and fit single model to pooled response • Calibrate all arrays with single calibration factor • Compute array sensitivity and dampen frequencies
Normalization: Li-Wong • Applied at the probe level (.DAT or .CEL files) • For a group of arrays • Select common baseline (array with median overall brightness as measured by the median probe cell intensity) • Normalize all arrays (except the baseline) to this baseline • Select a set of “non-differentially expressed” genes computationally (iterative procedure which looks at ranks) • Fit a normalization curve
Normalization: SAFER • Applied at probe set level • Initial centering of chips • Plot chip effects vs overall expression level (grand median) for each probe set: one plot per chip • Omit probe sets that appear to change • (Between group |dev|)/(Within group |dev|) • Omit probe sets in top 25% • Fit a resistant scatterplot smoother (lowess)
Normalization: Bolsted, Astrand • B. Bolsted: quantile normalization • Applied at the probe cell (PM) level • Normalize each chip against all others by making the quantiles of all chips agree • M. Astrand: • A procedure which extends the ideas by Yang et al. for two-colored arrays to the situation in which there are no pairs, but each chip needs to be normalized against all others
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment • Li-Wong • Michael Elashoff (GLGC) • Bill Craven (GLGC)
Quality Assessment: Li-Wong • Iterative procedure: • For a given “probe response pattern” (set of fitted probe sensitivity indices), the (conditional) SE attached to a fitted MBEI can be used to identify “outlier arrays”. These arrays are excluded and the remaining ones are used to estimate the probe response pattern for the given probe set. • For a given set of fitted MBEI (for a probe set across a collection of arrays), the (conditional) SE attached to a fitted probe can be used to identify problematic probes. • Single outliers (image spike in one array affecting just one PM-MM difference) are also identified. • Work with X. Fang on model-based saturation handling: • Borrow info from MMs to impute values of saturated PMs.
Quality Assessment: GLGC • Automated Chip QC (M. Elashoff) • Database of passing and failing chips to serve as the training set (5K passing, 2K failing) • QC for: dimness, high background, unevenness, spots, haze band, scratches, crop circle, grid misalignment, etc. • Implements Li-Wong and • Compares distributions of oulier count for passing and failing chips in training set • Determines upper bound of acceptable oulier count • The Dilution Series (B. Craven) • Two sources of cRNA, A (human liver tissue) and B (CNS cell line), hybridized to human chip (HGU95A) in a range of proportions and dilutions, with replicates • This can be used for method assessment as in Irizarry and Speed