670 likes | 874 Views
CSCI 5461: Functional Genomics, Systems Biology and Bioinformatics (Spring 2012). Lecture 8: Gene expression analysis/Hypothesis testing. Chad Myers Department of Computer Science and Engineering University of Minnesota cmyers@cs.umn.edu. Outline for today.
E N D
CSCI 5461: Functional Genomics, Systems Biology and Bioinformatics (Spring 2012) Lecture 8: Gene expression analysis/Hypothesis testing Chad Myers Department of Computer Science and Engineering University of Minnesota cmyers@cs.umn.edu
Outline for today • Multiple hypothesis testing problem and corrections • Overview of data normalization/processing for microarrays/RNA seq
Wilcoxon Rank-sum test • Uses rank data • Good for non-normal data • Identifies genes with skewed distribution of ranks • Since we only care about rank, less sensitive to outliers • But it is also less sensitive when sample size is small
Wilcoxon Rank-sum test • Procedure: • Merge all the observations from the two classes and rank them in ascending order • Calculate the Wilcoxon statistics: sum of the ranks of the observations from the smaller class • Find the p-value in the Wilcoxon rank sum distribution table (order statistics, approximately normal)
Wilcoxon Rank-sum test • Calculate the Wilcoxon statistics: sum of the ranks of the observations from the smaller class W-stat=1+2+3+5+6+8+10=35
Parametric vs non-parametric tests Advantages/disadvantages?
Non-parametric vs. parametric • Advantages: • No assumptions about data– tests always apply • If used properly, more conservative (not violating assumptions) • Disadvantages: • Depending on the test, you can lose power if you don’t leverage properties of the data (e.g. ranksum ignores magnitude of observations, only uses ordering) • Can be computationally intensive (permutations/resampling to estimate null) • When estimating null distribution: resolution on significance estimates
Example: Cancer Classification • Golub data • 3051 genes • 27 acute lymphoblastic leukemia (ALL) • 11 acute myeloid leukemia (AML)
Multiple hypothesis testing problem Assume we choose to reject the null hypothesis when p-value < 0.05 Assume you have data that fit the null distribution How many false positives if you test 10000 hypotheses?
Multiple Testing Problem • Simultaneously test G null hypotheses, one for each gene j • Because gene expression experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue • Would like some sense of how ‘surprising’ the observed results are
Multiple Testing Problem • Assume simultaneously monitoring expression levels of N genes • Assume a global significance level α • Can we find an adjusted per-gene significance level α’ to control the global probability of any mistake? • Family-wise error rate (FWER).
Control of the FWER • Šidák correction (assumes indep.) • Bonferroni correction: upper bound on the prob. of at least one mistake (Boole’s ineq.) • Often get very stringent adjustment with FWER
Control of the FWER • Holm’s step-wise correction • Rank genes by their p-values calculated with some statistics in ascending order • Compare the p-value (pi) of i-th gene with the threshold • Reject all i-1 such that • Get larger threshold on p-value as the rank becomes larger After a rejection, there remain one fewer tests. The next Bonferroni correction is of size N-i+1.
What about dependence among genes? Permutation-based approaches • Permute the label but keep the dependence structure between the genes’ expression • Take into account the joint distribution of the test statistics • Usually, less conservative than Bonferroni adjusted p-values • Computationally intensive to generate permutation; might need sampling
Westfall and Young minP permutation-based approach • Permutation correction (Westfall & Young , 1993) step-down minP adjusted p-values • Permuting the class labels of samples • For each permutation b and each gene i calculate a p-value using Holm’s step-wise correction • Get the corrected p-value (maintains dependence structure between the genes’ expression)
Multiple hypothesis test correction alternatives • Per-comparison error rate = E(V/m) • Family-wise error rate = p(V > 0) • False discovery rate = E(V/R) (No correction)
Multiple hypothesis test correction alternatives • Per-comparison error rate: expected proportion of type I errors. (no correction) • PCER = E(V/m) = α • Family-wise error rate: the probability of at least one type I error. (Holm’s step down, Bonferroni…) • FWER = p(V > 0) = • False discovery rate: the proportion of false positives among set of rejected hypotheses • FDR = E(V/R)
Control of the FDR: • Benjamini & Hochberg et al. 1995Benjamini and Hochberg “step-up” procedure • Rank genes by their p-values calculated with some statistics in ascending order • Find the maximum i, such that p-value (pi) of i-th gene satisfies • Reject null hypothesis for the first 1-ith genes (assumes genes are independent) Dependent case: Benjamini and Yekutieli (2001)
Newer alternatives for controlling the FDR pFDR approach, including no assumption of dependence Storey, J. D. & Tibshirani, R. (2001). Estimating false discovery rates under dependence,with applications to DNA microarrays, Technical Report 2001-28, Department of Statistics,Stanford University. Storey, J. D. (2002). A direct approach to false discovery rates, Journal of the Royal Statistical Society, Series B 64: 479-498. Others: Efron, B. & Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology 23: 70-86. Tusher et al. (SAM) Nice review: Ge, Dudoit, and Speed. Resampling-based multiple testing for microarray data analysis. Technical Report # 633, 2003.
Example: Permutation approach to estimate FDR (Tusher et al.) • t1 and t2 used as cutoffs. • Calculate the average number of genes that exceed these values in the permutations. • Estimate the number of falsely significant genes, under H0: • Divide by the number of genes called significant
FWER or FDR? (less conservative, but more powerful) Chose control of the FWER if high confidence in all selected genes is desired. Loss of power due to large number of tests: many differentially expressed genes may not appear as significant. (less powerful, but more conservative) If a certain proportion of false positives is tolerable: Procedures based on FDR are more flexible; the researcher can decide how many genes to select, based on practical considerations.
Gene expression data normalization and pre-processing(microarrays and RNAseq)
Different types of microarrays # of samples hybridized Single-channel Dual-channel cDNA Library Method of generating probes Synthesized
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation
Scanning Hybridized Microarray Excitation Laser 1 Laser 2 Emission Monochrome pictures combined (two-color arrays)
Image Segmentation Numerical Data Scanned Image Segmentation Software
Why normalize? • Microarray data have significant systematic variation both within arrays and between arrays that is not truebiological variation • Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects • Sources of variation: • Spatial location on the array • Dye biases which vary with spot intensity • Plate origin • Printing/spotting quality • Experimenter
Why is normalization important? Experiment: Comparison of gene expression response in mouse heart and kidney in response to drug Most biological effects are swamped by systematic effects! Source: http://www.partek.com
Log ratios treat up- and down-regulated genes equally log2(1) = 0 log2(2) = 1 log2(1/2) = -1 (two-color arrays)
A note about Affymetrix (1-color) pre-processing Typical Affymetrix probe intensity distribution Log transform After log-transform
Normalization overview • Within-array • Correct systematic differences among probes within the same chip (spatial, intensity, …) • Between-array • Correct global biases across multiple arrays, so that comparisons can be made
Within-array normalization: Spatial biases Solution: spatial background estimation/subtraction
Intensity-dependent normalization (Yang, Speed) (Lowess – local linear fit) • Compensate for intensity-dependent biases
Detecting Intensity-dependent biasesM vs A plots (also called R-I plot) • X axis: A – average intensity A = 0.5*log(Cy3*Cy5) • Y axis: M – log ratio M = log(Cy3/Cy5)
M>0: Cy3>Cy5 M<0: Cy3<Cy5 Low intensities High intensities Intensity-dependent bias M = log(Cy3/Cy5) A
We expect the M vs A plot to look like: M = log(Cy3/Cy5) A
Estimated values of log(Cy5/Cy3) as function of log(Cy3*Cy5) LOWESS (Locally Weighted Scatterplot Smoothing) • Local linear regression model • Tri-cube weight function • Least Squares
Other strategies for between-array normalization • Global (linear) scaling • Quantile normalization
Global linear normalization • A linear normalization (scale and/or shift)is computed for balancing chips\channels: Xinorm = k*Xi or log2 R/G log2 R/G – c (2-color) • Equalizes the mean (median) intensity/ratio and variance among compared chips • Assumption: Total RNA (mass) used is same for both samplesaveraged across thousands of genes, total hybridization should be the same for both samples
Global Normalization (2-color) Un-normalized Normalized Frequency Log-ratios log2 R/G log2 R/G – c where e.g. c = log2 (∑Ri/ ∑Gi) or c = ∑ log2 (Ri/Gi)
Quantile Normalization • One of the most widely used methods (implemented in many standard tools) • Ignores causes of variation and technical covariates • Shoehorns all data into the same shape distribution – matching quantiles
Motivation: probe intensity distribution across 23 Affy replicates (black is global)
Quantile Normalization Distribution of probe intensities Reference distribution xnorm = F2-1(F1(x)) Density function Assumes: gene distribution changes little F1(x) F2(x) Cumulative Distribution Function x y (Irizarry et al 2002)
Quantile normalization applied to 23 Affy replicates (black is reference distribution)
Which genes/probes should be used for between-array normalization? • All genes on the chip • Housekeeping genes (pre-defined) • “Invariant” genes (estimated from the data) • Spiked-in controls(each method has advantages/disadvantages in different settings)
Normalization - tools • Normalization is typically provided in microarray vendor’s software/core facilities but you should always understand the data you’re working with • How has your data been processed? • Are there any lingering effects? • Bioconductor (both Affymetrix and two-color): • Many different packages implemented in R language: affy, oligo, limma • dChip (Affymetrix): • Quantile, Invariant set • MAANOVA • Microarray ANOVA analysis For Affymetrix arrays specifically: • MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM and MM probes) • RMA by Speed group (UC Berkeley) (ignores MM probes)
General rules of thumb for dealing with systematic biases in arrays • Keep track of all possible labels! • Careful experiment design (randomization goes a long way!) • Replicates (both biological and technical) • Investigate for possible systematic effects • Check global intensity, log-ratio distributions across arrays in your dataset • ANOVA (Analysis of Variance) including various experimental factors • Correct obvious effects where possible • Generalized linear models • Physical error models
Summary for normalization • Systematic biases exist in microarray data • Normalization can help to remove these biases, to leave the biology behind • Within-array vs. between-array normalization methods • Design experiments to minimize biases in the first place!