130 likes | 297 Views
Statistics for RNA- seq Analysis. Moscow Genomic Data Analysis 2012 Mark Reimers, PhD. Basic Statistics. Variation in read counts should follow a Poisson distribution … almost does if replicates are done by same lab in same batch on same machine from same library
E N D
Statistics for RNA-seqAnalysis Moscow Genomic Data Analysis 2012 Mark Reimers, PhD
Basic Statistics • Variation in read counts should follow a Poisson distribution … almost does if replicates are done by same lab in same batch on same machine from same library • Distribution of counts in one sample typically follows a power law
Distribution of Counts within one Sample follows a Power Law If we plot the counts against the number of genes with those counts on log-log plot, we see something like a straight line (with some bump at 0)
Approaches to Significance • Continuous (easy) • At current read depths (>50M) most genes of interest are well above the threshold for continuity approximation • Discrete (hard) • All data are counts, and many are quite low, well below the acceptable n > 5 for continuous approximation • Cost-effective studies will use multiplexing and so counts will remain low
Issues with Continuous Approximation • Data are NOT anywhere near Gaussian • Discrete counts under five may be poorly approximated by continuous distributions • Select only those with mean at least five • Ad-hoc fix: Winsorize data and do t-tests • Typically there are excess zeroes resulting in extreme values
Models for Count Data • Poisson model • Standard model for count data • Negative Binomial Model • Higher variance than Poisson • Zero-inflated (mixture) model • Allows excess 0 counts beyond either above
Poisson Model • Describes counts of independent events where each has small probability of occurring, such as reads from one gene Poisson distributions with various means
Why is GLM for RNA-Seq hard? • Data from biological replicates are greatly over-dispersed compared to Poisson distribution • Common ad-hoc fix: model by Negative Binomial • Typically there are excess zeroes beyond NB or any standard discrete distribution
Negative Binomial Distribution Negative Binomial distributions for p = 0.2, and various r Note r = 0.5 defined by analogy • Generalization of geometric distribution • Repeat Bernoulli(p) trials • Count number of non-selected outcomes until r selected outcomes
Alternate Parameterization by Mean and Over-Dispersion Negative Binomial may also be parameterized by mean and variance: m = pr/(1-p) s2=pr/(1-p)2 Over-dispersion parameter q: s2 = m + qm2 q= 1/r; p = qm/(1+qm) If q= 0, Poisson Negative Binomial distributions for m = 10, and various q
Using the Negative Binomial Model to Test for Differential Expression • Assume dispersion parameters are identical between samples • Test for difference of means using Likelihood Ratio Test • log( P(x1 | m1, q) P(x2| m2, q) / P((x1, x2 ) | m, q ) ) ~ c2 • Can also use t-test if estimate covariance matrix for parameters • Issue: what if library sizes differ?