Statistics for RNA- seq Analysis

Statistics for RNA-seqAnalysis Moscow Genomic Data Analysis 2012 Mark Reimers, PhD

Basic Statistics • Variation in read counts should follow a Poisson distribution … almost does if replicates are done by same lab in same batch on same machine from same library • Distribution of counts in one sample typically follows a power law

Distribution of Counts within one Sample follows a Power Law If we plot the counts against the number of genes with those counts on log-log plot, we see something like a straight line (with some bump at 0)

RNA-Seq Significance Testing

Approaches to Significance • Continuous (easy) • At current read depths (>50M) most genes of interest are well above the threshold for continuity approximation • Discrete (hard) • All data are counts, and many are quite low, well below the acceptable n > 5 for continuous approximation • Cost-effective studies will use multiplexing and so counts will remain low

Issues with Continuous Approximation • Data are NOT anywhere near Gaussian • Discrete counts under five may be poorly approximated by continuous distributions • Select only those with mean at least five • Ad-hoc fix: Winsorize data and do t-tests • Typically there are excess zeroes resulting in extreme values

Models for Count Data • Poisson model • Standard model for count data • Negative Binomial Model • Higher variance than Poisson • Zero-inflated (mixture) model • Allows excess 0 counts beyond either above

Poisson Model • Describes counts of independent events where each has small probability of occurring, such as reads from one gene Poisson distributions with various means

Variance Increases with Mean

Why is GLM for RNA-Seq hard? • Data from biological replicates are greatly over-dispersed compared to Poisson distribution • Common ad-hoc fix: model by Negative Binomial • Typically there are excess zeroes beyond NB or any standard discrete distribution

Negative Binomial Distribution Negative Binomial distributions for p = 0.2, and various r Note r = 0.5 defined by analogy • Generalization of geometric distribution • Repeat Bernoulli(p) trials • Count number of non-selected outcomes until r selected outcomes

Alternate Parameterization by Mean and Over-Dispersion Negative Binomial may also be parameterized by mean and variance: m = pr/(1-p) s2=pr/(1-p)2 Over-dispersion parameter q: s2 = m + qm2 q= 1/r; p = qm/(1+qm) If q= 0, Poisson Negative Binomial distributions for m = 10, and various q

Using the Negative Binomial Model to Test for Differential Expression • Assume dispersion parameters are identical between samples • Test for difference of means using Likelihood Ratio Test • log( P(x1 | m1, q) P(x2| m2, q) / P((x1, x2 ) | m, q ) ) ~ c2 • Can also use t-test if estimate covariance matrix for parameters • Issue: what if library sizes differ?

Statistics for RNA- seq Analysis

Statistics for RNA- seq Analysis

Presentation Transcript

RNA-Seq

RNA-Seq and transcriptome analysis

RNA- seq Analysis

RNA- Seq Lab

RNA seq (I)

Le RNA-seq

Bioinformatics for DNA - seq and RNA- seq experiments

RNA seq analysis with reference genome

RNA- seq Analysis Practical Exercise

RNA-Seq and transcriptome analysis

RNA-seq data

RNA-Seq datasets

Bioinformatics Pipelines for RNA- Seq Data Analysis

RNA- seq Analysis in Galaxy

Uncovering the Popularity of RNA Seq Analysis

RNA-Seq Workshop for the Bioinformatician

RNA-SEQ