Differential expression analysis for sequence count data

Differential expression analysisfor sequence count data Wolfgang Huber Simon Anders

Context • Research group on statistical methods for genome biology • Joint appointment between EMBL HD and EBI • non-coding RNA, pervasive transcription • genetics of complex traits • HT microscopy for systems analysis • large-scale combinatorial RNAi and morphology phenotypes • ‘metrology’ for several genomic and proteomic technologies • Bioconductor

Samples Count data in HTS • RNA-Seq • Tag-Seq • ChIP-Seq • Bar-Seq • GliNS1 G144 G166 G179 CB541 CB660 • 13CDNA73 4 0 6 1 0 5 • A2BP1 19 18 20 7 1 8 • A2M 2724 2209 13 49 193 548 • A4GALT 0 0 48 0 0 0 • AAAS 57 29 224 49 202 92 • AACS 1904 1294 5073 5365 3737 3511 • AADACL1 3 13 239 683 158 40 • [...] Genes

Effect size vs significance

Statistical testing • Formulate a null hypothesis (e.g. ‘expression levels in these two conditions are the same’) • Define a value computed from the data (‘test statistic’) • Use your understanding of the null hypothesis, and the rules of probability calculus, to derive its null distribution. Compare the observed value with the distribution - if its value is too extreme, that is unlikely to have happened by chance: reject the null hypothesis.

Challenges with count data • discrete, positive, skewed (i.e. no normal approximation) • small numbers of replicates (i.e. cannot use distribution-free methods, e.g. rank based or permutation) • sequencing depth (coverage) varies

Strategies that have served us well with microarray data • Use a distribution approximation in order to infer the ‘tail behaviour’ (probability of extreme values) from mean and variance. • Share data across genes in order to improve the estimation of the variance: similar genes should have similar variance. • limma / eBayes • SAGE: edgeR by Robinson and Smyth

Variance and mean are correlated • Tag-Seq counts of two replicate gliablastoma-derived tissue cultures (P. Bertone / EBI) local regression v = f(x) + x linear v = ax2 + x Poisson v = x

Technical and biological replicates RNA-Seq of yeast (Nagalakshmi et al. 2008) biological replicates technical replicates

Poisson • Is a natural ‘first try’ for count data. It models the minimal amount of variability that just comes from random sampling - even if all other variables are exactly fixed. • It fits well for technical replicates1 - but hopelessly underestimates variance for biological replicates2. • 1 Marioni et al. (2008) • 2 Robinson and Smyth (2007), Nagalakshmi et al. (2008)

The negative-binomial distribution overdispersion parameter

NB distribution can be motivated by a hierarchical model Biological sample to sample variability Γ Poisson counting statistics P Overall distribution NB NB(μ, σ2 + μ) = Γ (μ, σ2) ∗ P(μ)

Model fitting • to get an unbiased estimate of σi², subtract an estimator of the “shot-noise” contribution

Testing for differential expression • We use a test similar to the one used in edgeR • For each of two conditions A and B, add the counts from all replicates, and consider them NB-distributed with moments as fitted. • Calculate the probability of observing the difference KiA- KiB (or more extreme), conditioned on the sum KiA+ KiB, resulting in a p value.

Differential expression • RNA-Seq data: tumor vs control

Type I error control • Comparison of one GNS replicate with another one.

Selection across the dynamic range all transcripts Hits from: - DESeq - edgeR

Working without replicates • Comparing 1-vs-1 with 2-vs-2 620 202 271 15,529

Variance-stabilizing transformation • The estimated variance-mean dependence allows a transformation that renders the count data approximately homoskedastic. This is useful e.g. for computing sample-sample distances.

Conclusions • Parametric model provides power for detecting differentially expressed genes even if there are few replicates (while controlling type I error) • Poisson model describes the minimal amount of error - between biological replicates, it will be larger. • Key assumption: negative binomial distribution. Mean estimated directly for each gene. Variance (or over-dispersion) estimated jointly for all genes, in the form of a local regression relationship. • Software: R package DESeq • Extensions: transcript length (RNA-Seq) other covariates (‘GC’) more complex contrasts (ANOVA), regression on continuous variables

Acknowledge-ments • Simon Anders • Bernd Fischer • Greg Pau • Elin Axelsson • Daniel Murrell • Julien Gagneur • Nicolas delHomme • Stefan Wilkening • Emilie Fritsch • Lars Steinmetz • Paul Bertone • Jan Korbel • All contributors to the R and Bioconductor projects

Differential expression analysis for sequence count data