210 likes | 431 Views
Introduction to DESeq and edgeR packages. Peter A.C. ’ t Hoen. Poisson distribution.
E N D
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen
Poisson distribution • discrete probability distribution that expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event • = expected k = number of occurrences
Count process • Poisson distribution Yt~ Poisson(λt) with λt = pnt t: tag λ: true expression Y: observed expression p: probability n: total number of RNA molecules • Truncated Poisson distribution: zero can mean not expressed or not counted • Count variance ~ λt • Murray F Freeman and John W Tukey. Ann Math Statist, 21:607-611, (1950)
Negative binomial distribution • discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified (non-random) number r of failures occurs • also arises as a continuous mixture of Poisson distributions where the mixing distribution of the Poisson rate is a gamma distribution. That is, we can view the negative binomial as a Poisson(λ) distribution, where λ is itself a random variable, distributed according to Gamma(r, p/(1 − p)).
edgeR (1) • Robinson, Smyth (Biostatistics, 2008; Bioinformatics 2007) • Package available from Bioconductor with very informative vignette Yij ~ NB (ij , ) Var (Yij) = ij ( 1 + ij x ) • Negative binomial (gamma Poisson) with average mu • Phi is overdispersion parameter (biological variation) • = 0 gives Poisson distribution
edgeR (2) • Test per gene Ygij ~ NB (gij ,g ) where gij = Mj x pgj Var (Ygij) = gij ( 1 + ij x g) pgi is proportion of tags for tag g in sample i Mj is library size for sample i and library j g is dispersion parameter for tag g
edgeR (3) • Estimation of common dispersion parameter by conditioning g on the sum of counts and maximizing the common likelihood lC() = lg(g) • Common dispersion parameter OR weighted linear combination of common and individual likelihoods WL (g) = lg(g) + lC(g)
edgeR (4) • Exact test replacing hypergeometric probabilities with NB-derived probabilities (qCML) for single factor experiment • Generalized linear models and Cox-Reid profile-adjusted likelihood (CR) method for multifactorial experiments
edgeR: what is new? • Exact Test not able to work with confounders replaced by generalized linear model with log likelihood ratio test • Abundance trending in dispersion estimates
Dispersion trend dispersion abundance
Dispersion trending (after filtering for low ab) dispersion abundance
DESeq (1) • Anders and Huber: Genome Biology (2010) 11:R106 • Roughly same principles as edgeR • No multifactorial analysis implemented yet
DESeq (2) (1) Yij ~ NB (ij ,σ2ij ) (2)ij = sj qi,ρ(j) sj scaling factor for sample j qi,ρ(j) proportional concentration of tag i in condition ρ (3)σ2ij = ij + s2jνi,ρ(j) νi,ρ(j) is a smooth function depending on qi,ρ(j) (concentration) Extra variance Count noise
DESeq (3): variance trend with expression Purple: Poisson Dashed orange: edgeR (before trending) Orange: DESeq You can derive: Squared CV is 1/μ + φ
DESeq (3) • Differences with edgeR: • Complete shrinkage to trended dispersion; limited tagwise dispersion estimates • Different variance estimates for different sample groups allowed • Deals better with samples with large differences in read depth?
DESeq (4): statistical testing • In analogy to initial edgeR implementation exact test on the NB probabilities in the two conditions
Conclusions • edgeR and DESeq are comparable implementation of statistical tests using NB distribution • edgeR and DESeq produce largely similar results • Implementation of generalized linear models in edgeR allows for testing with confounders • Results comparable to limma for medium – high expressed genes: modeling of stochastic effects is particularly important for low expressed genes