1 / 12

Assessing changes in data – Part 1, Statistics

This tutorial explores differential gene expression analysis between conditions. Learn about null hypothesis testing, count variance estimation, and distribution models like Poisson and Negative Binomial. Discover the importance of replicates and addressing overdispersion in sequencing data analysis.

Download Presentation

Assessing changes in data – Part 1, Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assessing changes in data – Part 1, Statistics

  2. Differential Expression Analysis (basics): The goal: to determine the set of genes whose read count differences, between two conditions, are greater than expected by chance. Gene Condition B Gene Condition A Null hypothesis: the mean read counts of the samples for condition A are equal to the mean read counts of the samples for condition B. The null hypothesis is tested for each gene. This amounts to testing for deviations of the count variance for each gene from the expected variance. Thus, good estimates of count variance for each gene are essential to determining if any change is significant.

  3. Notation for count data Sample A Sample B Read counts are presumed to follow some particular statistical distribution, which is defined by a collection of parameters: Gene 1 Gene 2 Ultimately, we want to know if the difference we see in counts on a gene is statistically significant. This essentially amounts to asking, how does the following compare to the standard normal? (max likelihood estimates for samples A and B) (Wald test) SO then what type of distribution, , is appropriate for count data?

  4. What distribution describes read count data? The binomial distribution describes the probability of a number of successes (k) in a fixed number of binary events (n), each with a constant probability of success (p). For example: the probability of having heads in a series of coin flips, each with a probability to land heads up of For our purposes, with sequencing data: • We can think of an individual read as a single Bernoulli trial (a flip of the coin), which either originated from a given gene (success) or not (failure). • Thus each read count on a gene represents a single successful “event.”

  5. What distribution describes read count data? BUT… • For sequencing data, the number of “events” are not well defined. Furthermore, the number of reads that are sampled in sequencing are very large relative to the average number of successful events for a gene (i.e the probability term, p, is extremely small) • In this situation, the Poisson distribution is a better approximation for the distribution of read counts in each feature/gene.

  6. Poisson distribution The Poisson distribution is a discrete, single parameter () probability distribution. Probability mass function: Here represents the mean value () of the distribution. The Poisson distribution has the special property that it’s variance is equal to it’s mean: For sequence data, represents the average expression level (counts) of a gene for a given experimental condition. This should be constant. In other words: However, this isn’t the full story…

  7. Replicates Fundamentally, replicates are necessarily in order to be able to draw more generalized conclusions about one’s experimental conditions. Without replicates one lacks all knowledge about biological variation of samples within a treatment group. Without some estimation of variability within a group, there is no sound statistical basis for inferring the significance of the differences between groups. The two types of replicates: technicaland biological replicates.

  8. Replicates • Technical replicates • Multiple instances of sequence, generated from the same sample (flow cells, lanes, etc.) • Biological replicates • Multiple isolations of cells showing the same experimental condition (environmental factors, growth conditions, time, etc.) Theoretically, replicates should be highly correlated ()

  9. Overdispersion Generally, the variance seen in technical replicates is well estimated by the Poisson distribution BUT biological replicates empirically demonstrate greater variance (“overdispersion”) than this distribution can account for. Sample A vs. Sample B Sample A Sample B QUESTION: Where does this additional variance in biological replicates come from? RPM (tech rep 1) RPM (tech rep 1) RPM (A) RPM (B) RPM (tech rep 2) RPM (tech rep 2) A: It’s because the average expression level of a gene will vary from one biological replicate to the next (this is counter to the assumption made for our Poisson model)!

  10. Overdispersion So, we see that for the read count on gene in sample , the total variance is equal to the Poisson variance (the average counts for gene in sample , ) plussome additional amount (). QUESTION: What model can account for the added variance? A: The negative binomial distribution! counts

  11. Negative Binomial Model This is evident when the two models are compared to actual count data! Since the variance of the NB is a quadratic function of the mean, it can therefore grow more quickly than the linear variance of the Poisson. Thus: (only for )

  12. Estimating dispersion • Unfortunately, most experiments are limited to few biological and technical replicates per condition due to the time associated with library preparation and the costs associated with sequencing. • A byproduct of having few replicates is that gene-wise estimates of dispersion are unreliable Getting around this shortcoming… • Some tools make the assumption that genes with similar expression levels have similar dispersion • Thus, individual gene-wise dispersion estimates () can be plotted versus mean counts () and a parametric model fit to the scatter plot, given by: • This serves as a common dispersion trend to be shared across genes

More Related