Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 3Statistical Analysis Paul Boutros Microarray Data Analysis June 24-25, 2013

Outline Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine Learning

What Are The Outputs of A Microarray Study? Primary Data Raw image (.DAT file) Quantitation (.CEL file) Secondary Data Normalized data (usually an ASCII text file) QA/QC plots Tertiary Data Statistical analyses Global visualization (e.g. heatmaps) Downstream analyses (e.g. pathway, dataset-integration) These file can be 10s of GB for a typical Affy study

How Do You Organize These Data? I recommend you put things on a fast, backed-up network drive /data/ Organize data by project /data/Project Create separate directories for each analysis /data/Project/raw /data/Project/QAQC /data/Project/pre-processing /data/Project/statistical /data/Project/pathway

How Do You Organize The Scripts? I recommend you write a separate script for each analysis, and put those in a standardized (backed-up!) location, mirroring the directory structure and naming of your dataset directories. Some sub-structure here is often useful: /scripts/Project/pre-processing.R /scripts/Project/statistical-univariate.R /scripts/Project/statistical-multivariate.R /scripts/Project/pathway/GOMiner.R /scripts/Project/pathway/Reactome.R /scripts/Project/integration/mRNA+CNV.R /scripts/Project/integration/public-data.R

Why Many Small Scripts? Monolithic scripts are hard to maintain Easier to make errors Accidentally re-using the same variable name Harder to debug Harder for somebody else to learn Small scripts are more flexible Quicker to modify/re-run a small part of your analysis Easier to re-use the same code on another dataset This is akin to the “unix” mindset of systems design

What To Save? Everything!! All QA/QC plots (common reviewer request) All pre-processed data (needed for GEO uploads) Gene-wise statistical analyses Not just the statistically-significant genes Collapse all analyses into one file, though All plots/etc Using clear filenames is critical Disk-space is not usually a critical concern here Your raw data will be much larger than your output!

Most Important Points Do not delete things: Keep all old versions of your scripts by including the date in the filename (or using source-control) Version output files by date I have needed to go back to analyses done 7 years prior! Make regular (weekly) backups: Try to pass this work off to professional sysadmins External hard-drives/USBs are okay if you cannot get access to network drives, but try to automate

Outline Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning

Not All Experimental Designs Are Simple:Alternative Questions Are a large number of groups different? Do two things synergize? Are mRNA levels correlated to something? Are mRNA levels associated with survival? Can we use mRNA levels to predict things?

General Linear Modeling The underlying mathematical framework for most statistical techniques we are familiar with: ANOVAs Logistic regression Linear regression Multiple regression Y = a0 + a1x1+ a2x2 + … NOT the same as a “Generalized Linear Model”!!!

General Linear Modeling: Special Cases Y = a0 + a1x1 x1 continuous Linear Regression Y factorial Y = a0 + a1x1 Logistic Regression x1,x2 continuous Y = a0 + a1x1 + a2x2 Multiple Regression

ANOVAs x1 factorial Y = a0 + a1x1 1-way ANOVA Y = a0 + a1x1 + a2x2 + a3x1x2 x1x2 two-level factors 2-way ANOVA

ANOVA Experimental Designs Are Common Classic one-way ANOVAs: Treat a cell-line with 5 drugs – do any of them make a difference? Make 5 different genetic mutations – do any of them alter gene-expression? H0: the mean of at least one group differs Guesses at the assumptions?

Assumptions Are Similar to T-test Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

1-Way ANOVAs in R (Part 1) # read the data normally raw.data <- ReadAffy(); eset <- expresso(…); # localize for readability expression.matrix <- exprs(eset); # have a list of groups groups <- as.factor( pData(eset)$x );

1-Way ANOVAs in R (Part 2) # loop over each gene for (i in 1:nrow(expression.matrix)) { # fit a one-way anova tmp <- aov(expression.matrix[i,] ~ x); # extract p-value pvalue <- summary(tmp)[[1]][1,5]; }

But This Is Limited A 1-way ANOVA just says that one group differs Which one  post hoc tests No microarray-specific aspects here Note: connection to multiple-testing

Sometimes 1-Way ANOVAs are not worth the Effort Mutation 1 Wildtype Mutation 2 1-way ANOVA + post hoc Or 2 t-tests?

Not Always Testing Raw Data Vehicle 1 Drug 1 Vehicle 2 Drug 2 Drug 3 Vehicle 3 1-way ANOVA on the fold-changes 3 drugs with different controls

Two-Way ANOVAs Probably even more common than one-way ANOVAs Very powerful: Synergy? Additivity? Antagonism? Y = a0 + a1x1 + a2x2 + a3x1x2 Assumptions?

Assumptions Are Similar to 1-Way ANOVA Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

Do these treatments interact? Standard approach: ANOVA Interaction Treatment #2 Treatment #1

ExampleRadiation Toxicity Some people are prone to late-stage radio-toxicity Does radiation induce specific patterns of gene-expression in these people? 3 Gy 3 Gy Radiation 0 Gy 0 Gy Radio-Sensitive

Solution Fit an ANOVA model to each gene Most effects are due to radiation alone, minimal interaction

Two-Way ANOVAs in R The limma package is one very good approach for this Alternatively standard model-fitting using the lm() function can be done for each gene We will cover each approach in Tutorial #3

Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning Outline

So Far We Have Considered Exactly-Defined Groups 80% 45% 15% 30% 85% 70% Six cell-lines with differential sensitivity to a drug What genes are associated with this phenomenon?

Two Basic Approaches Correlation metrics Correlations Mutual Information Fit linear models with continuous variables Y= mRNA abundance a0 = basal level a1 = effect of drug x1 = drug sensitivity Y = a0 + a1x1

Correlation Basics Start from the beginning, univariate statistics: Variance = Var(X) = E[(X – μX)2] Standard Deviation = [Var(X)]0.5 But if you have two variables, how are they related? Covariance = Cox(X,Y) = E[ (X – μX)(Y – μY) ] Correlation is a scaled form of the covariance

Basic Properties of Correlations Unit-less Variance and covariance have squared units Standard deviation has normal units Range [-1.0,1.0] Range is independent of sample-size Range is independent of the range of X and Y Captures the degree in which two variables change together

Relationship Types Correlation > 0 Variables positively correlated When one goes up, the other one tends to as well Correlation < 0 Variables negatively (inversely) correlated When one goes up, the other tends to go down Correlation = 0 No relationship NB: if variables are independent, then correlation = covariance = 0 NB: if correlation = covariance = 0, variables may be independent

Pearson’s Correlation Most common correlation metric, R Measures linear relationship between two variables R = Cov(X, Y) / (σXσY)

Pearson’s R Cannot Capture Non-Linear Relationships Correctly

Spearman’s Rank-Order Correlation Second most-common correlation, ρ (Greek rho) Makes no assumptions about the relationships between variables Simplified version of Pearson’s R Works directly on ranks di = xi – yi (the differences between ranks) ρ= 1 – (6Σdi) / [n (n2 – 1) ]

Spearman Example

Outline Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning

Survival Analysis A major new area in microarray analysis Works with any right-censored data Censoring: the value is only partially known Right-censoring: the value is at least this large Final outcome is not known: Patients are still alive at the time of the analysis An adverse drug-reaction has not happened yet Standard statistical approaches in use

Typical Survival Curve

Key Survival Statistics Cox proportional hazards model HR = hazard ratio P = probability the hazard ratio is not 1.0 Log-rank test Probability two curves differ

Example Beer and coworkers studied non-small cell lung cancer using an older Affymetrix microarray: 12 samples of normal lung 83 samples of non-small cell lung cancer ~10,000 genes on their array Two questions: How many genes are associated with tumour-initiation? How many genes are associated with tumour-progression?

Tumour Initiation: per-gene t-tests More genes repressed Fewer oncogenes?!

Tumour Progression: per gene Cox models More genes are involved in helping a tumour resist treatment and grow larger than in “making” it in the first place! P < 0.05 733 Genes 230 Genes P < 0.01 63 Genes 136 Genes P < 0.001 2 Genes 15 Genes Progression Initiation

Warning There are several assumptions to a Cox model: Non-parametric No assumptions made about “baseline hazard” Censoring must be independent of events You shouldn’t be more likely to lose follow-up information on patients who die Hazard must be proportional No changes across time In general you want to have a statistician around to ensure you are doing survival analyses correctly.

Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning Outline

Meta-Analysis Combining results of multiple-studies that study related hypotheses Often used to merge data from different microarray platforms Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data

Why Do Meta-Analysis? Can identify publication biases Appropriately weights diverse studies Sample-size Experimental-reliability Similarity of study-specific hypotheses to the overall one Increases statistical power Reduces information A single meta-analysis vs. five large studies Provides clearer guidance

Challenges of Meta-Analysis No control for bias What happens if most studies are poorly designed? File-drawer problem Publication bias can be detected, but not explicitly controlled for How homogeneous is the data? Can it be fairly grouped? Simpson’s Paradox

Canadian Bioinformatics Workshops