1 / 42

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Lecture 5 Multivariate Analyses II: General Models. MBP1010 Dr. Paul C. Boutros Winter 2014. †. Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE). D EPARTMENT OF

vangie
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Lecture 5Multivariate Analyses II: General Models MBP1010 Dr. Paul C. Boutros Winter 2014 † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) DEPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others †

  4. Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

  5. House Rules • Cell phones to silent • No side conversations • Hands up for questions

  6. Topics For This Week • Review to date • Examples • Assignment #1 • Attendance • More on Multivariate Models

  7. Review From Lecture #2 Compares two samples or a sample and a distribution. Straight line indicates identity. How can you interpret a QQ plot? What is hypothesis testing? Confirmatory data-analysis; test null hypothesis What is a p-value? Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone

  8. Review From Lecture #2 Parametric tests have distributional assumptions Parametric vs. non-parametric tests What is the t-statistic? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity

  9. Flow-Chart For Two-Sample Tests Is Data Sampled From a Normally-Distributed Population? Yes No Sufficient n for CLT (>30)? Equal Variance (F-Test)? Yes Yes No No Heteroscedastic T-Test Homoscedastic T-Test Wilcoxon U-Test

  10. Review From Lecture #3 Probability a test will incorrect reject the null AKA sensitivity or 1- false-negative rate What is statistical power? What is a correlation? A relationship between two (random) variables Common correlation metrics? Pearson, Spearman, Kendall

  11. Lecture #3 Review • Hypergeometric test • Is a sample randomly selected from a fixed population? • Proportion test • Are two proportions equivalent? • Fisher’s Exact test • Are two binary classifications associated? • (Pearson’s) Chi-Squared Test • Are paired observations on two variables independent?

  12. Example #1 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: tumours in mice lacking TS will be smaller than those in mice with amplification of OG, as assessed by post-mortem volume measurements of the primary tumour. Your data: OG (cm3) 5.2 1.9 5.0 6.1 4.5 4.8 TS (cm3) 3.9 7.1 3.1 4.4 5.0

  13. Example #2 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS will acquire tumours sooner than mice with amplification of OG. You test the mice weekly using ultrasound imaging. Your data: TS (week of tumour) 4 2 5 4 4 OG (week of tumour) 3 6 3 2 4 3

  14. Example #3 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS are less likely to respond to a novel targeted therapeutic (DX) than those with amplification of OG as assessed by a trained pathologist: TS (pathological response) Yes No Yes Yes No OG (pathological response) Yes Yes Yes Yes No Yes

  15. Example #4 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Based on your previous data, you now hypothesize that mice lacking TS will show a similar molecular response to DX as those with amplification of OG. You use microarrays to study 20,000 genes in each line, and identify the following genes as changed between drug-treated and vehicle-treated: OG (DX-responsive genes) MYC KRAS CD53 CDH1 MUC1 MARCH1 PTEN IDH3 ESR2 RHEB CTCF STK11 MLL3 KEAP1 NFE2L2 ARID1A TS (DX-responsive genes) MYC KRAS CD53 CDH1 FBW1 SEPT7 MUC1 MUC3 MUC9 RNF3

  16. Review From Lecture #4 One variable is a response and one a predictor No adjustment is needed for confounding or other between-subject variation Linearity σ2 is constant, independent of x Predictors are independent of each other For proper statistical inference (CI, p-values), errors are normally distributed Assumptions of linear-modeling

  17. Review From Lecture #4 By considering the size of the residuals (R2) How do we assess the adequacy of a model? How can we test the quality of a model? Residual plots; qq plots; prediction accuracy Compare a one-way ANOVA to a logistic regression Linear model where x is factorial vs. one where y is factorial

  18. Lots of Analyses Are Linear Regressions Y = a0 + a1x1 x1 continuous Linear Regression Y factorial Y = a0 + a1x1 Logistic Regression x1 factorial Y = a0 + a1x1 1-way ANOVA

  19. Now Let’s Go Over Assignment #1 Tip #1: avoid reserved words data Tip #2: take advantage of file-handling arguments Shorter code readability Tip #3: consistent indentation

  20. Attendance Break

  21. When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show a use of statistics in a (very, very) recent Nature paper. Advance Online Publication

  22. Cervix Cancer 101 • Diesease burden increasing • (~380k to ~450k in the last 30 years) • By age 50, >80% of women have HPV infection • >75% of sexually active women exposed, only a subset affected • Why is nearly totally unknown! • Tightly Associated with Poverty

  23. HPV Infection Associated Multiple Cancers • Cervix >99% • Anal ~85% • Vaginal ~70% • Vulvar ~40% • Penile ~45% • Head & Neck ~20-30% Of course not all of these are the HPV subtypes caught by current vaccines, but a majority are. Thus many cancers are preventable.

  24. Figure 1 is a Classic Sequencing Figure Mutation rate vs. histology

  25. But Histology Is Associated With Age

  26. Age Is Associated With Mutation Rate R2 = 0.08; p = 0.005  Is this meaningful? 4.2/Mbp 1.6/Mbp P(Wilcoxon) = 0.0095

  27. Perhaps Not in Isolation But...

  28. The Solution: Linear Regression Mutation Rate = a0 + x1a1 + x2a2 x1 = histology indicator (adeno = 1; squam = 0) x2 = age in years (continuous) Mutation Rate = 0.259 - 0.145x1 + 0.006x2 P(a1 ≠ 0) = 0.045 P(a2 ≠ 0) = 0.012

  29. General Linear Modeling The underlying mathematical framework for most statistical techniques we are familiar with: ANOVAs Logistic regression Linear regression Multiple regression Y = a0 + a1x1+ a2x2 + … NOT the same as a “Generalized Linear Model”!!!

  30. General Linear Modeling: Special Cases Y = a0 + a1x1 x1 continuous Linear Regression Y factorial Y = a0 + a1x1 Logistic Regression x1,x2 continuous Y = a0 + a1x1 + a2x2 Multiple Regression

  31. ANOVAs x1 factorial Y = a0 + a1x1 1-way ANOVA Y = a0 + a1x1 + a2x2 + a3x1x2 x1x2 two-level factors 2-way ANOVA

  32. ANOVA Experimental Designs Are Common Classic one-way ANOVAs: Treat a cell-line with 5 drugs – do any of them make a difference? Make 5 different genetic mutations – do any of them alter gene-expression? H0: the mean of at least one group differs Guesses at the assumptions?

  33. Assumptions Are Similar to T-test Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

  34. But This Is Limited A 1-way ANOVA just says that one group differs Which one  post hoc tests Often hard to know which post hoc test to use, often worth consulting a statistician here

  35. Sometimes 1-Way ANOVAs are not worth the Effort Mutation 1 Wildtype Mutation 2 1-way ANOVA + post hoc Or 2 t-tests?

  36. Not Always Testing Raw Data Vehicle 1 Drug 1 Vehicle 2 Drug 2 Drug 3 Vehicle 3 1-way ANOVA on the fold-changes 3 drugs with different controls

  37. Two-Way ANOVAs Probably even more common than one-way ANOVAs Very powerful: Synergy? Additivity? Antagonism? Y = a0 + a1x1 + a2x2 + a3x1x2 Assumptions?

  38. Assumptions Are Similar to 1-Way ANOVA Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

  39. Do these treatments interact? Standard approach: ANOVA Interaction Treatment #2 Treatment #1

  40. Example: Radiation Toxicity Some people are prone to late-stage radio-toxicity Does radiation induce specific patterns of gene-expression in these people? 3 Gy 3 Gy Radiation 0 Gy 0 Gy Radio-Sensitive

  41. Two-Way ANOVAs in R Standard model-fitting uses the lm() function For microarray and –omic analyses, the limma package is one very good approach for this(covered over the next few weeks)

  42. Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

More Related