230 likes | 311 Views
Summarization of Oligonucleotide Expression Arrays. BIOS 691-803 Winter 2010. What is Summarization?. Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’ Typically probes have different fold changes between any two samples
E N D
Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010
What is Summarization? • Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’ • Typically probes have different fold changes between any two samples • How to effectively summarize the information in a probe set?
5´ 3´ Gene Sequence Multiple oligo probes Perfect Match Mismatch Many Probes for One Gene How to combine signals from multiple probes into a single gene abundance estimate?
Probe Variation • Individual probes don’t agree on fold changes • Probes vary by two orders of magnitude on each chip • CG content is most important factor in signal strength Signal from 16 probes along one gene on one chip
Probe Measure Variation • Typical probes are two orders of magnitude different! • CG content is most important factor • RNA target folding also affects hybridization 3x104 0
Bioinformatics Issues • Probes may not map accurately • SNP’s in probes • Affymetrix places most probes in 3’UTR of genes • Alternate Poly-A sites mean that some probe targets may really be less common than others
Probe Mapping • Early builds of the genome often confused regions or genes and their complements • Probe sets at right represent probe sets for rRNA gene and its complement
Alternate Poly-Adenylation Sites Poly-A marks mRNA ‘tail’ Many genes have alternatives 3’ UTR may be longer or shorter
Many Approaches to Summarization • Affymetrix MicroArray Suite; PLiER • dChip - Li and Wong, HSPH • Bioconductor: • RMA - Bolstad, Irizarry, Speed, et al • affyPLM – Bolstad • gcRMA – Wu • Physical chemistry models – Zhang et al • Factor model • Probe-weighting
Critique of Averaging (MAS5) • Not clear what an average of different probes should mean • Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here • No ‘learning’ based on cross-chip performance of individual probes
Motivation for multi-chip models: Probe level data from spike-in study ( log scale ) note parallel trend of all probes Courtesy of Terry Speed
Model for Probe Signal • Each probe signal is proportional to • i) the amount of target sample – a • ii) the affinity of the specific probe sequence to the target – f • NB: High affinity is not the same as Specificity • Probe can give high signal to intended target and also to other transcripts Probes 1 2 3 chip 1 a1 a2 chip 2 f1f2f3
Multiplicative Model • For each gene, a set of probes p1,…,pk • Each probe pj binds the gene with efficiency fj • In each sample there is an amount ai. • Probe intensity should be proportional to fjxai • Always some noise!
Robust Linear Models • Criterion of fit • Least median squares • Sum of weighted squares • Least squares and throw out outliers • Method for finding fit • High-dimensional search • Iteratively re-weighted least squares • Median Polish
Bolstad, Irizarry, Speed – (RMA) • For each probe set, take log of PMij =ai fj: • then fit the model: • where caret represents “after pre-processing” • Fit this additive model by iteratively re-weighted least-squares or median polish Critique: Model assumes probe noise is constant (homoschedastic) on log scale
Comparing Measures Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA 20 replicate arrays – variance should be small Standard deviations of expression estimates on arrays arranged in four groups of genes by increasing mean expression level Courtesy of Terry Speed
Background • 25-mers are prone to cross-hybridization • MM > PM for about 1/3 of all probes • Cross-hybridization varies with GC content • Signal intensity varies with cross-hybe
Estimate non-specific binding using either: True null assay (non-homologous RNA) Estimates from MM Subtract background before normalization and fitting model The gcRMA Approach
Evaluating gcRMA • On AffyComp data sets, gcRMA wins • Replicates with 14 spike-ins done by Affy • Many investigators get crappy results (and don’t write it up) • gcRMA does very well on highly expressed genes, not nearly so well on less expressed genes • Gharaibeh et al.BMC Bioinformatics 2008 9:452
Factor Model • Assume relation between p observations x and true value z: x =lz + e where ei are independent • Use factor analytic methods to estimate l • Depends on assuming z ~ Normal • Differs from RMA in relaxing assumption of IID errors – some probes can have more random error than others
Weighting Probes • It is clear that some probes are more reliable than others • How to assess this in a simple fashion? • If a gene really changes across arrays, then a responsive probe will change more than a noisy probe • Weight by relative ranges • Best performance on AffyComp!
Summary and Evaluation • No one best solution for all situations • gcRMA and DFW seem to do very well on AffyComp data • May need weights for DFW by tissue • Leading methods seem to rely on probe weighting