Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Lecture 5Multivariate Analyses II: General Models MBP1010 Dr. Paul C. Boutros Winter 2014 † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) DEPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others †

Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

House Rules • Cell phones to silent • No side conversations • Hands up for questions

Topics For This Week • Review to date • Examples • Assignment #1 • Attendance • More on Multivariate Models

Review From Lecture #2 Compares two samples or a sample and a distribution. Straight line indicates identity. How can you interpret a QQ plot? What is hypothesis testing? Confirmatory data-analysis; test null hypothesis What is a p-value? Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone

Review From Lecture #2 Parametric tests have distributional assumptions Parametric vs. non-parametric tests What is the t-statistic? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity

Flow-Chart For Two-Sample Tests Is Data Sampled From a Normally-Distributed Population? Yes No Sufficient n for CLT (>30)? Equal Variance (F-Test)? Yes Yes No No Heteroscedastic T-Test Homoscedastic T-Test Wilcoxon U-Test

Review From Lecture #3 Probability a test will incorrect reject the null AKA sensitivity or 1- false-negative rate What is statistical power? What is a correlation? A relationship between two (random) variables Common correlation metrics? Pearson, Spearman, Kendall

Lecture #3 Review • Hypergeometric test • Is a sample randomly selected from a fixed population? • Proportion test • Are two proportions equivalent? • Fisher’s Exact test • Are two binary classifications associated? • (Pearson’s) Chi-Squared Test • Are paired observations on two variables independent?

Example #1 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: tumours in mice lacking TS will be smaller than those in mice with amplification of OG, as assessed by post-mortem volume measurements of the primary tumour. Your data: OG (cm3) 5.2 1.9 5.0 6.1 4.5 4.8 TS (cm3) 3.9 7.1 3.1 4.4 5.0

Example #2 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS will acquire tumours sooner than mice with amplification of OG. You test the mice weekly using ultrasound imaging. Your data: TS (week of tumour) 4 2 5 4 4 OG (week of tumour) 3 6 3 2 4 3

Example #3 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS are less likely to respond to a novel targeted therapeutic (DX) than those with amplification of OG as assessed by a trained pathologist: TS (pathological response) Yes No Yes Yes No OG (pathological response) Yes Yes Yes Yes No Yes

Example #4 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Based on your previous data, you now hypothesize that mice lacking TS will show a similar molecular response to DX as those with amplification of OG. You use microarrays to study 20,000 genes in each line, and identify the following genes as changed between drug-treated and vehicle-treated: OG (DX-responsive genes) MYC KRAS CD53 CDH1 MUC1 MARCH1 PTEN IDH3 ESR2 RHEB CTCF STK11 MLL3 KEAP1 NFE2L2 ARID1A TS (DX-responsive genes) MYC KRAS CD53 CDH1 FBW1 SEPT7 MUC1 MUC3 MUC9 RNF3

Review From Lecture #4 One variable is a response and one a predictor No adjustment is needed for confounding or other between-subject variation Linearity σ2 is constant, independent of x Predictors are independent of each other For proper statistical inference (CI, p-values), errors are normally distributed Assumptions of linear-modeling

Review From Lecture #4 By considering the size of the residuals (R2) How do we assess the adequacy of a model? How can we test the quality of a model? Residual plots; qq plots; prediction accuracy Compare a one-way ANOVA to a logistic regression Linear model where x is factorial vs. one where y is factorial

Lots of Analyses Are Linear Regressions Y = a0 + a1x1 x1 continuous Linear Regression Y factorial Y = a0 + a1x1 Logistic Regression x1 factorial Y = a0 + a1x1 1-way ANOVA

Now Let’s Go Over Assignment #1 Tip #1: avoid reserved words data Tip #2: take advantage of file-handling arguments Shorter code readability Tip #3: consistent indentation

Attendance Break

When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show a use of statistics in a (very, very) recent Nature paper. Advance Online Publication

Cervix Cancer 101 • Diesease burden increasing • (~380k to ~450k in the last 30 years) • By age 50, >80% of women have HPV infection • >75% of sexually active women exposed, only a subset affected • Why is nearly totally unknown! • Tightly Associated with Poverty

HPV Infection Associated Multiple Cancers • Cervix >99% • Anal ~85% • Vaginal ~70% • Vulvar ~40% • Penile ~45% • Head & Neck ~20-30% Of course not all of these are the HPV subtypes caught by current vaccines, but a majority are. Thus many cancers are preventable.

Figure 1 is a Classic Sequencing Figure Mutation rate vs. histology

But Histology Is Associated With Age

Age Is Associated With Mutation Rate R2 = 0.08; p = 0.005  Is this meaningful? 4.2/Mbp 1.6/Mbp P(Wilcoxon) = 0.0095

Perhaps Not in Isolation But...

The Solution: Linear Regression Mutation Rate = a0 + x1a1 + x2a2 x1 = histology indicator (adeno = 1; squam = 0) x2 = age in years (continuous) Mutation Rate = 0.259 - 0.145x1 + 0.006x2 P(a1 ≠ 0) = 0.045 P(a2 ≠ 0) = 0.012

General Linear Modeling The underlying mathematical framework for most statistical techniques we are familiar with: ANOVAs Logistic regression Linear regression Multiple regression Y = a0 + a1x1+ a2x2 + … NOT the same as a “Generalized Linear Model”!!!

General Linear Modeling: Special Cases Y = a0 + a1x1 x1 continuous Linear Regression Y factorial Y = a0 + a1x1 Logistic Regression x1,x2 continuous Y = a0 + a1x1 + a2x2 Multiple Regression

ANOVAs x1 factorial Y = a0 + a1x1 1-way ANOVA Y = a0 + a1x1 + a2x2 + a3x1x2 x1x2 two-level factors 2-way ANOVA

ANOVA Experimental Designs Are Common Classic one-way ANOVAs: Treat a cell-line with 5 drugs – do any of them make a difference? Make 5 different genetic mutations – do any of them alter gene-expression? H0: the mean of at least one group differs Guesses at the assumptions?

Assumptions Are Similar to T-test Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

But This Is Limited A 1-way ANOVA just says that one group differs Which one  post hoc tests Often hard to know which post hoc test to use, often worth consulting a statistician here

Sometimes 1-Way ANOVAs are not worth the Effort Mutation 1 Wildtype Mutation 2 1-way ANOVA + post hoc Or 2 t-tests?

Not Always Testing Raw Data Vehicle 1 Drug 1 Vehicle 2 Drug 2 Drug 3 Vehicle 3 1-way ANOVA on the fold-changes 3 drugs with different controls

Two-Way ANOVAs Probably even more common than one-way ANOVAs Very powerful: Synergy? Additivity? Antagonism? Y = a0 + a1x1 + a2x2 + a3x1x2 Assumptions?

Assumptions Are Similar to 1-Way ANOVA Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

Do these treatments interact? Standard approach: ANOVA Interaction Treatment #2 Treatment #1

Example: Radiation Toxicity Some people are prone to late-stage radio-toxicity Does radiation induce specific patterns of gene-expression in these people? 3 Gy 3 Gy Radiation 0 Gy 0 Gy Radio-Sensitive

Two-Way ANOVAs in R Standard model-fitting uses the lm() function For microarray and –omic analyses, the limma package is one very good approach for this(covered over the next few weeks)

Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

Canadian Bioinformatics Workshops