560 likes | 813 Views
Statistical analysis of expression data:. Normalization, differential expression and multiple testing Jelle Goeman. Outline. Normalization Expression variation Modeling the log Fold change Complex designs Shrinkage and empirical Bayes (limma) Multiple testing (False Discovery Rate).
E N D
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman
Outline • Normalization • Expression variation • Modeling the log Fold change • Complex designs • Shrinkage and empirical Bayes (limma) • Multiple testing (False Discovery Rate)
Platforms • Microarrays • RNAseq • Common: • Need for normalization • Batch effects
Why normalization • Some experimental factors cannot be completely controlled • Amount of material • Amount of degradation • Print tip differences • Quality of hybridization • Effects are systematic • Cause variation between samples and between batches
What is normalization? • Normalization = An attempt to get rid of unwanted systematic variation by statistical means • Note 1: this will never completely succeed • Note 2: this may do more harm than good • Much better, but often impossible Better control of the experimental conditions
How do normalization methods work? • General approach • Assume: data from an ideal experiment would have characteristic A E.g. mean expression is equal for each sample Note: this is an assumption! • If the data do not have characteristic A, change the data such that the data now do have characteristic A E.g. Multiply each sample’s expression by a factor
Example: quantile normalization • Assume: “Most probes are not differentially expressed” “As many probes are up and downregulated” • Reasonable consequence: The distribution of the expression values is identical for each sample • Normalization: Make the distribution of expression values identical for each sample
Quantile normalization in practice • Choose a target distribution • Typically the average of the measured distributions • All samples will get this distribution after normalization • Quantile normalization: • Replace the ith largest expression value in each sample by the ith largest value in the target distribution • Consequence: • Distribution of expressions the same between samples • Expressions for specific genes may differ
Less radical forms of normalization • Make the means per sample the same • Make the medians the same • Make the variances the same • Loess curve smoothing • Same idea, but less change to the data
Overnormalizing • Normalizing can remove or reduce true biological differences • Example: global increase in expression • Normalization can create differences that are not there • Example: almost global increase in expression • Usually: normalization reduces unwanted variation
Batch effects • Differences between batches are even stronger than between samples in the same batch • Note: batch effects at several stages • Normalization is not sufficient to remove batch-effects • Methods available (comBat) but not perfect • Best: avoid batch effects if possible
Confounding by batch • Take care of batch-effects in experimental design • Problem: confounding of effect of interest by batch effects • Example: Golub data • Solution: balance or randomize
Differential expression • Two experimental conditions • Treated versus untreated • Two distinct phenotypes • Tumor versus normal tissue • Which genes can reliably be called differentially expressed? • Also: continuous phenotypes • Which gene expressions are correlated with phenotype?
Variation in gene expression • Technical variation • Variation due to measurement technique • Variability of measured expression from experiment to experiment on the same subject • Biological variation • Variation between subjects/samples • Variability of “true” expression between different subjects • Total variation • Sum of technical and biological variation
Reliable assessment • Two samples always have different expression • Maybe even a high fold change • Due to random biological and technical variation • Reliable assessment of differential expression: • Show: fold change found cannot be explained by random variation
Assessment of differential expression • Two interrelated aspects: • Fold change: • How large is the expression difference found? • P-value: • How sure are we that a true difference exists?
Modeling variation • How does gene expression depend on experimental conditions? • Can often be well modeled with linear models • Limma: • linear models for microarray analysis • Gordon Smyth, W. and E. Hall Institute, Australia
Multiplicative scale effects • Assumption: effects on gene expression work in a multiplicative way (“fold change”) • Example: treatment increases gene expression of gene MMP8 by a factor 2 • “2-fold increase” • Treatment decreases gene expression of gene MMP8 by a factor 2 • “2-fold decrease”
Multiplicative scale errors • Assumption: variation on gene expression works in a multiplicative way • A 2-fold increase by chance is just as likely as a 2-fold decrease by chance • When true expression is 4, measuring 8 is as likely as measuring 2
Working on the log scale • When effects are multiplicative, log-transform! • Usual in microarray analysis: log to base 2 • Remember: log(ab) = log(a)+log(b) • 2 fold increase = +1 to log expression • 2 fold decrease = -1 to log expression • Log scale makes multiplicative effects symmetric • ½ and 2 are not symmetric around 1 (= no change) • -1 and +1 are symmetric around 0 (= no change)
A simple linear model • Example: treated and untreated samples • Model separately for each gene • Log Expression of gene 1: E1 • E1 = a + b * Treatment + error • a: intercept = average untreated logexpression • b: slope = treatment effect
Modeling all genes simultaneously • E1 = a1 + b1 * Treatment + error • E2 = a2 + b2 * Treatment + error • … • E20,000 = a20,000 + b20,000 * Treatment + error • Same model, but • Separate intercept and slope for each gene • And separate sd sigma1, sigma2, … of error
Estimates and standard errors • Gene 1: Estimates for a1, b1 and sigma1 • Estimate of treatment effect of gene 1 • b1 is the estimated log fold change • standard error s.e.(b1) depends on sigma1 • Regular t-test for H0: b1=0: • T = b1/s.e.(b1) • Can be used to calculate p-values. • Just like regular regression, only 20,000 times
Back to original scale • Log scale regression coefficient b1 • Average log fold change • Back to a fold change: 2^b1 • b1= 1 becomes fold change 2 • b1 = -1 becomes fold change 1/2
Confounders • Other effects may influence gene expression • Example: batch effects • Example: sex or age of patients • In a linear model we can adjust for such confounders
Flexibility of the linear model • Earlier: E1 = a1 + b1 * Treatment + error • Generalize: • E1 = a1 + b1 * X + c1 * Y + d1 + Z + error • Add as many variables as you need.
Empirical Bayes • So far: each gene on its own • 20,000 unrelated models • Limma: exchange information between genes • “Borrowing strength” • By empirical Bayes arguments
Estimating variance • For each gene a variance is estimated • Small sample size: variance estimate is unreliable • Too small for some genes • Too large for others • Variance estimated too small: false positives • Variance estimated too large: low power
Large and small estimated variance • Gene with low variance estimate • Likely to have low true variance • But also: likely to have underestimated variance • Gene with high variance estimate • Likely to have high true variance • But also: likely to have overestimated variance • Limma’s idea: • Use information from other genes to assess whether variance is over/underestimated
Variance model • Limma has a gene variance model • All gene’s variances are drawn at random from an inverse gamma distribution • Based on this model: • Large variances are shrunk downwards • Small variances are shrunk upwards
Effect of variance shrinkage • Genes with large fold change and large variance • More power • More likely to be significant • Genes with small fold change and small variance • Less power • Less likely to be significant
Limma and sample size • Shrinkage of limma only effective for small sample size (< 10 samples/group) • Added information of other genes becomes negligeable if sample size gets large • Large samples: Doing limma is the same as doing regression per gene
Modelling count data • Distinguish three types of variation • Biological variation • Technical variation • Count variation • Count variation is important for low-expressed genes • Generally biological variation most important
Overdispersion • Modelling count data: two stages • Model how gene expression varies from sample to sample • Model how the observed count varies by repeated sequencing of the same sample • Stage 2 is specific for RNAseq
Two approaches • Approach 1: Model the count variation and the between-sample variation • edgeR • Deseq • Approach 2: Normalize the count data and model only the biological variation • Voom + limma • Approach 3: Model count variation only • Popular but very wrong!
20,000 p-values • Fitting 20,000 linear models • Some variance shrinkage • Result: • 20,000 fold changes • 20,000 p-values • Which ones are truly differentially expressed?
Multiple testing • Doing 20,000 tests: risk false positive 20,000 times • If 5% of null hypotheses is significant, expect 1,000 significant by pure chance • How to make sure you can really trust the results?
Bonferroni • Classical way of doing multiple testing • Call K the number of tests performed • Bonferroni: significant = p-value < 0.05/K • “Adjusted p-value” • Multiply all p-values by K, compare with 0.05
Advantages of Bonferroni • Familywise error control • =Probability of making any type I error < 0.05 • With 95% chance, list of differentially expressed genes has no errors • Very strict • Easy to do
Disadvantages of Bonferroni • Very strict • “No” false positives • Many false negatives • It is not a big problem to have a few false positives • Do validation experiments later
False discovery rate (Benjamini and Hochberg) • FDR = expected proportion of false discoveries among all discoveries • Control of FDR at 0.05 means in the long run experiments average about 5% type I errors among the reported genes • Percentage: longer lists of genes are allowed to have more errors
Benjamini and Hochberg by hand 1. Order the p-values small to large Example: 0.0031, 0.0034, 0.02, 0.10, 0.65 2. Multiply the k-th p-value by m/k, where m is the number of p-values, so 0.0031 * 5/1, 0.0034 * 5/2, 0.02 * 5/3, 0.10 * 5/4, 0.65 * 5/5 which becomes 0.0155, 0.0085, 0.033, 0.125, 0.65 3. If the p-values are no longer in increasing order, replace each p-value by the smallest p-value that is later in the list. In the example, we replace 0.0155 by 0.0085. The final Benjamini-Hochberg adjusted p-values become 0.0085, 0.0085, 0.033, 0.125, 0.65