Lecture 8: Gene expression analysis/Hypothesis testing

CSCI 5461: Functional Genomics, Systems Biology and Bioinformatics (Spring 2012) Lecture 8: Gene expression analysis/Hypothesis testing Chad Myers Department of Computer Science and Engineering University of Minnesota cmyers@cs.umn.edu

Outline for today • Multiple hypothesis testing problem and corrections • Overview of data normalization/processing for microarrays/RNA seq

Wilcoxon Rank-sum test • Uses rank data • Good for non-normal data • Identifies genes with skewed distribution of ranks • Since we only care about rank, less sensitive to outliers • But it is also less sensitive when sample size is small

Wilcoxon Rank-sum test • Procedure: • Merge all the observations from the two classes and rank them in ascending order • Calculate the Wilcoxon statistics: sum of the ranks of the observations from the smaller class • Find the p-value in the Wilcoxon rank sum distribution table (order statistics, approximately normal)

Wilcoxon Rank-sum test • Calculate the Wilcoxon statistics: sum of the ranks of the observations from the smaller class W-stat=1+2+3+5+6+8+10=35

Parametric vs non-parametric tests Advantages/disadvantages?

Non-parametric vs. parametric • Advantages: • No assumptions about data– tests always apply • If used properly, more conservative (not violating assumptions) • Disadvantages: • Depending on the test, you can lose power if you don’t leverage properties of the data (e.g. ranksum ignores magnitude of observations, only uses ordering) • Can be computationally intensive (permutations/resampling to estimate null) • When estimating null distribution: resolution on significance estimates

Example: Cancer Classification • Golub data • 3051 genes • 27 acute lymphoblastic leukemia (ALL) • 11 acute myeloid leukemia (AML)

Multiple hypothesis testing problem Assume we choose to reject the null hypothesis when p-value < 0.05 Assume you have data that fit the null distribution How many false positives if you test 10000 hypotheses?

Multiple Testing Problem • Simultaneously test G null hypotheses, one for each gene j • Because gene expression experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue • Would like some sense of how ‘surprising’ the observed results are

Multiple Testing Problem • Assume simultaneously monitoring expression levels of N genes • Assume a global significance level α • Can we find an adjusted per-gene significance level α’ to control the global probability of any mistake? • Family-wise error rate (FWER).

Control of the FWER • Šidák correction (assumes indep.) • Bonferroni correction: upper bound on the prob. of at least one mistake (Boole’s ineq.) • Often get very stringent adjustment with FWER

Control of the FWER • Holm’s step-wise correction • Rank genes by their p-values calculated with some statistics in ascending order • Compare the p-value (pi) of i-th gene with the threshold • Reject all i-1 such that • Get larger threshold on p-value as the rank becomes larger After a rejection, there remain one fewer tests. The next Bonferroni correction is of size N-i+1.

What about dependence among genes? Permutation-based approaches • Permute the label but keep the dependence structure between the genes’ expression • Take into account the joint distribution of the test statistics • Usually, less conservative than Bonferroni adjusted p-values • Computationally intensive to generate permutation; might need sampling

Westfall and Young minP permutation-based approach • Permutation correction (Westfall & Young , 1993) step-down minP adjusted p-values • Permuting the class labels of samples • For each permutation b and each gene i calculate a p-value using Holm’s step-wise correction • Get the corrected p-value (maintains dependence structure between the genes’ expression)

Multiple hypothesis test correction alternatives • Per-comparison error rate = E(V/m) • Family-wise error rate = p(V > 0) • False discovery rate = E(V/R) (No correction)

Multiple hypothesis test correction alternatives • Per-comparison error rate: expected proportion of type I errors. (no correction) • PCER = E(V/m) = α • Family-wise error rate: the probability of at least one type I error. (Holm’s step down, Bonferroni…) • FWER = p(V > 0) = • False discovery rate: the proportion of false positives among set of rejected hypotheses • FDR = E(V/R)

Control of the FDR: • Benjamini & Hochberg et al. 1995Benjamini and Hochberg “step-up” procedure • Rank genes by their p-values calculated with some statistics in ascending order • Find the maximum i, such that p-value (pi) of i-th gene satisfies • Reject null hypothesis for the first 1-ith genes (assumes genes are independent) Dependent case: Benjamini and Yekutieli (2001)

Newer alternatives for controlling the FDR pFDR approach, including no assumption of dependence Storey, J. D. & Tibshirani, R. (2001). Estimating false discovery rates under dependence,with applications to DNA microarrays, Technical Report 2001-28, Department of Statistics,Stanford University. Storey, J. D. (2002). A direct approach to false discovery rates, Journal of the Royal Statistical Society, Series B 64: 479-498. Others: Efron, B. & Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology 23: 70-86. Tusher et al. (SAM) Nice review: Ge, Dudoit, and Speed. Resampling-based multiple testing for microarray data analysis. Technical Report # 633, 2003.

Example: Permutation approach to estimate FDR (Tusher et al.) • t1 and t2 used as cutoffs. • Calculate the average number of genes that exceed these values in the permutations. • Estimate the number of falsely significant genes, under H0: • Divide by the number of genes called significant

Tusher FDR estimate (cont’d)

FWER or FDR? (less conservative, but more powerful) Chose control of the FWER if high confidence in all selected genes is desired. Loss of power due to large number of tests: many differentially expressed genes may not appear as significant. (less powerful, but more conservative) If a certain proportion of false positives is tolerable: Procedures based on FDR are more flexible; the researcher can decide how many genes to select, based on practical considerations.

Gene expression data normalization and pre-processing(microarrays and RNAseq)

Different types of microarrays # of samples hybridized Single-channel Dual-channel cDNA Library Method of generating probes Synthesized

Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation

Scanning Hybridized Microarray Excitation Laser 1 Laser 2 Emission Monochrome pictures combined (two-color arrays)

Image Segmentation Numerical Data Scanned Image Segmentation Software

Why normalize? • Microarray data have significant systematic variation both within arrays and between arrays that is not truebiological variation • Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects • Sources of variation: • Spatial location on the array • Dye biases which vary with spot intensity • Plate origin • Printing/spotting quality • Experimenter

Why is normalization important? Experiment: Comparison of gene expression response in mouse heart and kidney in response to drug Most biological effects are swamped by systematic effects! Source: http://www.partek.com

The Starting Point: The Ratio (2-color arrays)

Log ratios treat up- and down-regulated genes equally log2(1) = 0 log2(2) = 1 log2(1/2) = -1 (two-color arrays)

A note about Affymetrix (1-color) pre-processing Typical Affymetrix probe intensity distribution Log transform After log-transform

Normalization overview • Within-array • Correct systematic differences among probes within the same chip (spatial, intensity, …) • Between-array • Correct global biases across multiple arrays, so that comparisons can be made

Within-array normalization: Spatial biases Solution: spatial background estimation/subtraction

Intensity-dependent normalization (Yang, Speed) (Lowess – local linear fit) • Compensate for intensity-dependent biases

Detecting Intensity-dependent biasesM vs A plots (also called R-I plot) • X axis: A – average intensity A = 0.5*log(Cy3*Cy5) • Y axis: M – log ratio M = log(Cy3/Cy5)

M>0: Cy3>Cy5 M<0: Cy3<Cy5 Low intensities High intensities Intensity-dependent bias M = log(Cy3/Cy5) A

We expect the M vs A plot to look like: M = log(Cy3/Cy5) A

Estimated values of log(Cy5/Cy3) as function of log(Cy3*Cy5) LOWESS (Locally Weighted Scatterplot Smoothing) • Local linear regression model • Tri-cube weight function • Least Squares

Other strategies for between-array normalization • Global (linear) scaling • Quantile normalization

Global linear normalization • A linear normalization (scale and/or shift)is computed for balancing chips\channels: Xinorm = k*Xi or log2 R/G  log2 R/G – c (2-color) • Equalizes the mean (median) intensity/ratio and variance among compared chips • Assumption: Total RNA (mass) used is same for both samplesaveraged across thousands of genes, total hybridization should be the same for both samples

Global Normalization (2-color) Un-normalized Normalized Frequency Log-ratios log2 R/G  log2 R/G – c where e.g. c = log2 (∑Ri/ ∑Gi) or c = ∑ log2 (Ri/Gi)

Quantile Normalization • One of the most widely used methods (implemented in many standard tools) • Ignores causes of variation and technical covariates • Shoehorns all data into the same shape distribution – matching quantiles

Motivation: probe intensity distribution across 23 Affy replicates (black is global)

Quantile Normalization Distribution of probe intensities Reference distribution xnorm = F2-1(F1(x)) Density function Assumes: gene distribution changes little F1(x) F2(x) Cumulative Distribution Function x y (Irizarry et al 2002)

Quantile normalization applied to 23 Affy replicates (black is reference distribution)

Which genes/probes should be used for between-array normalization? • All genes on the chip • Housekeeping genes (pre-defined) • “Invariant” genes (estimated from the data) • Spiked-in controls(each method has advantages/disadvantages in different settings)

Normalization - tools • Normalization is typically provided in microarray vendor’s software/core facilities but you should always understand the data you’re working with • How has your data been processed? • Are there any lingering effects? • Bioconductor (both Affymetrix and two-color): • Many different packages implemented in R language: affy, oligo, limma • dChip (Affymetrix): • Quantile, Invariant set • MAANOVA • Microarray ANOVA analysis For Affymetrix arrays specifically: • MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM and MM probes) • RMA by Speed group (UC Berkeley) (ignores MM probes)

General rules of thumb for dealing with systematic biases in arrays • Keep track of all possible labels! • Careful experiment design (randomization goes a long way!) • Replicates (both biological and technical) • Investigate for possible systematic effects • Check global intensity, log-ratio distributions across arrays in your dataset • ANOVA (Analysis of Variance) including various experimental factors • Correct obvious effects where possible • Generalized linear models • Physical error models

Summary for normalization • Systematic biases exist in microarray data • Normalization can help to remove these biases, to leave the biology behind • Within-array vs. between-array normalization methods • Design experiments to minimize biases in the first place!

Lecture 8: Gene expression analysis/Hypothesis testing

Lecture 8: Gene expression analysis/Hypothesis testing

Presentation Transcript

Gene Expression M.Tevfik DORAK dorak

Development and Validation of Predictive Classifiers using Gene Expression Profiles

BINF636 Clustering and Classification

Classification of Microarray Gene Expression Data

Probe analysis and data preprocessing

MCB 317 Genetics and Genomics

Regression, correlation and liquid association in complex genomic data analysis

Lecture series: Data analysis

Carlo Colantuoni carlo@illuminatobiotech

Chapter 5: DNA, Gene Expression, and Biotechnology

Gene Network Modeling

Report on IHMC- CMU-Pitt Research Full Report

Regulation of Gene Expression

Chapter 18

Chapter 18

Chapter 18

Chapter 31

Chapter 29 Transcription and the Regulation of Gene Expression

[V]. Process of Transcription and Transcriptional Control of Gene Expression

Regulation of Gene Expression