350 likes | 360 Views
Report on Workshop on low level analysis of Affymetrix GeneChip data (Bethesda, Nov 19 2001, sponsor GLGC). Elisabetta Manduchi lab meeting November 29, 2001. Topics. Image analysis Background adjustment Expression indices Normalization Quality assessment.
E N D
Report onWorkshop on low level analysis of Affymetrix GeneChip data(Bethesda, Nov 19 2001, sponsor GLGC) Elisabetta Manduchi lab meeting November 29, 2001
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Short Oligonucleotide Arrays:a review See paper by Lockhart et al. (1996). • Preparing the array • covalently attached oligonucleotides chemically synthesized directly on a solid substrate • for each mRNA being monitored, a collection (probe set) of probe pairs (16 to 20) is synthesized on the array • each probe pair consists of two probe cells: one containing (millions of) copies of a given 25-mer that is a perfect match (PM) to a subsequence of the mRNA in question and the other containing copies of a companion (MM) 25-mer that has a single base difference in a central position.
Short Oligonucleotide Arrays:a review (cont.) • Preparing the RNA source • polyA RNA is converted to cDNA • cDNA is transcribed in vitro in the presence of fluorescently labeled (biotin or fluorescein) ribonucleotides, giving rise to labeled RNA • RNA is then fragmented with heat (fragment average size of 50 to 100 bp). • Hybridization occurs in a flow-cell. A brief washing step follows to remove un-hybridized RNA.
PM MM One probe setone transcript
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Image Analysis: gridding • Only one presentation on this: Harry Zuzan (GLGC, Duke University) • Plots of the CVs of probe cells ( after the Affy MAS 4.0 gridding) in grayscale image revelead a pattern • This motivated his attempt to improve the grid alignment, that is to estimate locations of probe cells and pixels belonging to each such cell
Iterative algorithm: • Estimates locations of probe cell centers • Scanned probe cell locations are modelled as a continuous deformation of a lattice • Improves estimates by balancing • minimizing the variance of pixel intensity near probe cell centers • maintaining local lattice structure of probe cell locations • Results: • Considerable improvement with respect to the MAS 4.0 software • Apparently the new MAS algorithm performs better than the earlier one in this respect
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Background Adjustment • Typically done at the probe cell level • Probe cell intensities, as output in the .CEL files, are computed by taking the 3rd quartile of pixels distribution in that cell, after excluding bordering pixels • Background is then subtracted before computing expression indices (=probe set values) • Background calculation methods: • MAS 4.0 • new MAS • F. Naef et al. (Rockefeller) • R. Irizarry (JHU, Speed’s group)
Background adjustments: MAS 4.0 • Divide array into sectors (16 by default). • Calculate, for each sector, the average of the lowest 2% probe cell intensities. This is the sector’s background. • Subtract the sector’s background from each probe cell intensity in that sector.
Background adjustments: new MAS • Do not have the details, but it looks like background is no longer constant on a sector • Apparently, for each cell, its distance from each sector center is computed and used to weight that sector’s background contribution to the cell’s background
Background adjustments: Naef et al. • Idea: bkg. is insensitive to MM and visible at low intensity • Select probes (probe cells?) such that |PM-MM|< (locally?) • use =50 (new) or 100 (old settings) • Estimate bkg. from the PM (?) distribution of these • Also gives a “trick” for dealing with negative values, after background subtraction
Naef et al.:other issues explored/raised • Probe set statistics: • He observed that from data from over 100 chips, roughly 30% of the probe cells have MM>PM • These are not concentrated at low intensities: 27% of probe pairs with MM>PM are in the top quartile • MM are not consistently behaving as expected • What about not using them? • The probe cell intensities within a probe set vary over decades • difficult to estimate probe set intensities using “averages” (MAS 4.0)
Background adjustments: Irizarry • Computes a global background by estimating the mode of the MM distribution • From an exploratory study he found that using global background improves on use of probe-specific MM • However they are working on a better background measure, more carefully designed and tested
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Expression Indices • The issue: how to derive, from the collection of PM and MM measurements in a probe set, a unique value representing that probe set (=transcript) intensity • Methods: • MAS 4.0 • New MAS • Li-Wong (LW) model based approach • SAFER (D. Holder et al., Merck) • Variance-stabilizing transformations (P. Munson, NIH) • Irizarry and Speed
Expression Indices:MAS 4.0 • Average Difference (AD) method with A the subset of probes for which dj=PMj-MMj are within 3 SDs away from the average of d(2), …, d(J-1)where d(j)is the j-th smallest difference. This is called Super-Olympic-Scoring (SOS) method. • A presence/absent/marginal call is also attached to every probe set, based on a decision matrix, which considers, among other things, the average log(PM/MM).
Expression Indices:new MAS (MAS 5.0) • Presented by Earl Hubbell (Affy) at the workshop and by a marketer here at Penn the following day • Motivation was to improve certain areas: • AD method is minimally robust against minority probes • Negative values are impossible for concentration of intensity and indicate that bias is larger than true effect • Incompatible with standard log-transformation • The algorithm was illustrated but with some vagueness in details; two papers should appear in January • New algorithm for P/A calls is incorporated
Outline: • Adjust PM for stray signal, where stray estimate=best of two estimates • The probe set intensity (Signal) is given by taking the Tukey biweight of log(PM/stray) • The Tukey biweight gives a smooth downweight of ouliers; it’s a weighed (by MAD) mean • Stray signal are typically estimated using the MM values, but anomalous MM values are handled with imputation • It appears as: • stray=MM, if physically possible • log(stray)=log(PM)-log(stray proportion), otherwise where stray proportion=max(SB, positive) (should this be log(stray proportion)?) SB=Tukey biweight(log(PM)-log(MM))
Expression Indices:Li and Wong model based approach • Full model (LWF) MMij=j+ij+ij PMij=j+ij+ ij +ij • Reduced model (LWR) PMij- MMij=ij +ij This is for a given probe set: i runs over the arrays and j over the probes in the set • The MBEI is defined as the ML estimate of the i, j is the sensitivity index of probe j
Comparisons of expression indices F. Wright et al., Ohio State University • Both theoretical and experimental comparisons Theoretical comparisons based on the assumption that the Li-Wong model is true • Estimator variances were compared: LWF outperformed LWR, which outperformed AD; LWF outperformed PM-only, which outperformed LWR • Empirical comparisons based on mixture experiments with replicates (6 per condition) to compare LWF, LWR, AD, and LA (Affy Log Ave) • Again model-based estimators seemed superior to simple averaging
Expression Indices:SAFER Scale matters, Additive Fits (probes and chips), Experimental-unit variability, Robustness and resistance • Trasform PM-MM data via a linear-log hybrid scale • Fit (median polish) probe-specific model using all chips log*(PMij-MMij))=chipi+probej+errorij
Expression Indices:variance-stabilizing transformations • Generalized Log Transform on AD (GLog(AD)) • Adaptive Transform of AD (TAD): plot Log(SD) vs mean of NQ(AD), fit (splines) smooth function g: T(X)=Int(-, X, 1/g) • Much debate on transforms followed…
Expression Indices: Irizarry and Speed • Summarize expression level of a probe set by Average log2(PM-BG) • PMs need to be normalized • Background computed as illustrated in slide 15 • If PM-BG<0 use minimum of positives divided by 2 • They strongly suggest: • Not to subtract or divide by MM (they grow with concentration, showing that they detect signal as well as non-specific binding) • Probe effect is additive on log scale • Take logs, no other transformations • Used Gene Logic spike-in and dilution studies to analyze AD, Li-Wong, and their own measure • All three performed well • AvLog(PM-BG) was arguably the best of the three
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment
Normalization • MAS 4.0 • Don’t have details about MAS new algorithm • A. Hill (Wyeth-Ayerst Research) • Li-Wong • SAFER (D. Holder et al., Merck) • B. Bolstead (Speed’s group) quantile normalization • Magnus Astrand (AstraZeneca)
Normalization:MAS 4.0 • Applied at the probe set (gene expression index) level • The trimmed (2%) mean intensity of all probe sets on the array is scaled to a constant target level • There are obvious situations in which the “constant mean assumption” may not be well supported • Chips monitoring a “small” fraction of transcriptome • Non-random gene selection on arrays (e.g. C. elegans A vs B/C) • …
Normalization: A. Hill • A purely spike-based normalization strategy (Frequency) • Add 11 biotin-labeled cRNA spikes to each hybridization cocktail • Construct a calibration curve • Use A/P calls for the spikes to estimate array sensitivity • Dampen AD signals below the sensitivity level to eliminate negative AD values • A hybrid normalization (Scaled Frequency) • Motivation: variation in cRNA “purity”, need to reduce spike skew • Method: • Define a set of arrays and compute ADs for all arrays • Pool spike responses and fit single model to pooled response • Calibrate all arrays with single calibration factor • Compute array sensitivity and dampen frequencies
Normalization: Li-Wong • Applied at the probe level (.DAT or .CEL files) • For a group of arrays • Select common baseline (array with median overall brightness as measured by the median probe cell intensity) • Normalize all arrays (except the baseline) to this baseline • Select a set of “non-differentially expressed” genes computationally (iterative procedure which looks at ranks) • Fit a normalization curve
Normalization: SAFER • Applied at probe set level • Initial centering of chips • Plot chip effects vs overall expression level (grand median) for each probe set: one plot per chip • Omit probe sets that appear to change • (Between group |dev|)/(Within group |dev|) • Omit probe sets in top 25% • Fit a resistant scatterplot smoother (lowess)
Normalization: Bolsted, Astrand • B. Bolsted: quantile normalization • Applied at the probe cell (PM) level • Normalize each chip against all others by making the quantiles of all chips agree • M. Astrand: • A procedure which extends the ideas by Yang et al. for two-colored arrays to the situation in which there are no pairs, but each chip needs to be normalized against all others
Topics • Image analysis • Background adjustment • Expression indices • Normalization • Quality assessment • Li-Wong • Michael Elashoff (GLGC) • Bill Craven (GLGC)
Quality Assessment: Li-Wong • Iterative procedure: • For a given “probe response pattern” (set of fitted probe sensitivity indices), the (conditional) SE attached to a fitted MBEI can be used to identify “outlier arrays”. These arrays are excluded and the remaining ones are used to estimate the probe response pattern for the given probe set. • For a given set of fitted MBEI (for a probe set across a collection of arrays), the (conditional) SE attached to a fitted probe can be used to identify problematic probes. • Single outliers (image spike in one array affecting just one PM-MM difference) are also identified. • Work with X. Fang on model-based saturation handling: • Borrow info from MMs to impute values of saturated PMs.
Quality Assessment: GLGC • Automated Chip QC (M. Elashoff) • Database of passing and failing chips to serve as the training set (5K passing, 2K failing) • QC for: dimness, high background, unevenness, spots, haze band, scratches, crop circle, grid misalignment, etc. • Implements Li-Wong and • Compares distributions of oulier count for passing and failing chips in training set • Determines upper bound of acceptable oulier count • The Dilution Series (B. Craven) • Two sources of cRNA, A (human liver tissue) and B (CNS cell line), hybridized to human chip (HGU95A) in a range of proportions and dilutions, with replicates • This can be used for method assessment as in Irizarry and Speed