270 likes | 361 Views
AffyDEComp: towards a benchmark for differential expression methods. Richard Pearson School of Computer Science University of Manchester. Overview. Why benchmark DE methods? The Golden Spike data set AffyDEComp Conclusions Recommendations. The need for benchmarks.
E N D
AffyDEComp: towards a benchmark for differential expression methods Richard Pearson School of Computer Science University of Manchester
Overview • Why benchmark DE methods? • The Golden Spike data set • AffyDEComp • Conclusions • Recommendations
The need for benchmarks • Microarray analysis has many stages • Competing methods at each stage • Methodologists good at showing superiority • Results can appear contradictory • Confused end users choice driven by… • What they are familiar with • What colleagues use • What was used in their favourite paper • …and not by a scientific comparison
Benchmarking requirements • Methods: a set we wish to compare • Benchmark data: where truth is known • Metrics: by which to compare methods • Affycomp • Methods: Summarisation methods • Benchmark data: various spike-in studies • Metrics: various, including, e.g. area under ROC curve for a fold change classifier • Affycomp doesn’t compare DE methods
A benchmark for DE methods • Methods: • DE methods depend on summarisation • Compare summarisation/DE combinations • Benchmark data: • Affycomp spike-ins have few DE genes • Golden spike data has many DE genes, but also a few “issues”! • Metrics: • Based around areas under ROC curves
The Golden Spike data • 3 “sample”, 3 “control” arrays • Many RNAs “spiked-in” at known levels • “DE”, “Equal” and “Empty” probesets. • Controversial data set • Non-uniform null p-value distributions - use ROC • Spike-in concentrations high - unrepresentative • “DE” spike-ins all up-regulated - unrepresentative • Concentrations and FC confounded - loess • Different FC between “Equal” and “Empty”
“Empty” > FC than “Equal” • Most analyses have treated both Empty and Equal as True Negatives - to what effect?
“Empty” > FC than “Equal” • To illustrate how analysis choices effect results I’ll treat Empty and Equal as true negative (TN) and DE<=1.2 as true positive (TP)
2-sided test • Large apparent difference between methods • Can you guess which paper used this chart?
2-sided test • Large apparent difference between methods • Are TP correctly identified as up-regulated?
1-sided test of up-regulation • Probesets identified as up-regulated not TP
1-sided test of down-regulation • DE probesets are mostly being identified as down-regulated, despite the fact that they are in truth up-regulated We appear to be identifying TP as down-regulated
DE <=1.2 lower than Empty • TP are identified as down-regulated because most TN are “Empty” which have higher FC than DE <= 1.2
Remove “empty” probesets • We can remedy this by using just Equal probesets as our TN… • …bearing in mind that this makes the data somewhat atypical
Up-regulation - Empty in TN • Probesets identified as up-regulated generally not TP when using Empty in TN
Up-regulation - TN Equal • Probesets identified as up-regulated more likely to be TP when using only Equal as TN
Down-regulation - Empty in TN • DE probesets are mostly being identified as down-regulated, despite the fact that they are in truth up-regulated We appear to be identifying TP as down-regulated when including Empty in TN
Down-regulation - TN Equal We generally don’t identify TP as down-regulated when excluding Empty in TN
“Recommended” test • We recommend using just Equal as TN, and all DE as TP
Recommended Up-reg • Using our recommendations, tests of up-regulation generally find TP, as expected
Recommended Down-reg • Using our recommendations, tests of down-regulation generally don’t find TP, as expected
Analysis decisions to make • Summarisation method • DE method • Direction of DE (recommend up) • Choice of true negatives (equal only) • Choice of true positives (all DE) • Post-summarisation normalisation (loess using equal only) • Type of ROC chart (standard ROC) • Proportion of x-axis to display (all)
Conclusions • First step towards a reliable benchmark for DE • Golden Spike data has some value if use of empty probesets is revisited • Certain combinations of summarisation/DE methods seem poor • Keep it open (Bioconductor) - because science should be reproducible!
Recommendations • Create a new spike-in data set where • Spike-in concentrations are realistic • DE spike-ins both up- and down-regulated • Concentrations and FC not confounded • Larger number of arrays • Benchmarks using regulatory information • Benchmarks for Illumina data • Benchmarks for SNP chips (GWA studies) • manchester.ac.uk/bioinformatics/affydecomp