1 / 63

Canadian Bioinformatics Workshops

Learn how to organize and store microarray data, including raw and normalized data, and perform statistical analysis using techniques such as ANOVA and machine learning.

bence
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 3Statistical Analysis Paul Boutros Microarray Data Analysis June 24-25, 2013

  4. Outline Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine Learning

  5. What Are The Outputs of A Microarray Study? Primary Data Raw image (.DAT file) Quantitation (.CEL file) Secondary Data Normalized data (usually an ASCII text file) QA/QC plots Tertiary Data Statistical analyses Global visualization (e.g. heatmaps) Downstream analyses (e.g. pathway, dataset-integration) These file can be 10s of GB for a typical Affy study

  6. How Do You Organize These Data? I recommend you put things on a fast, backed-up network drive /data/ Organize data by project /data/Project Create separate directories for each analysis /data/Project/raw /data/Project/QAQC /data/Project/pre-processing /data/Project/statistical /data/Project/pathway

  7. How Do You Organize The Scripts? I recommend you write a separate script for each analysis, and put those in a standardized (backed-up!) location, mirroring the directory structure and naming of your dataset directories. Some sub-structure here is often useful: /scripts/Project/pre-processing.R /scripts/Project/statistical-univariate.R /scripts/Project/statistical-multivariate.R /scripts/Project/pathway/GOMiner.R /scripts/Project/pathway/Reactome.R /scripts/Project/integration/mRNA+CNV.R /scripts/Project/integration/public-data.R

  8. Why Many Small Scripts? Monolithic scripts are hard to maintain Easier to make errors Accidentally re-using the same variable name Harder to debug Harder for somebody else to learn Small scripts are more flexible Quicker to modify/re-run a small part of your analysis Easier to re-use the same code on another dataset This is akin to the “unix” mindset of systems design

  9. What To Save? Everything!! All QA/QC plots (common reviewer request) All pre-processed data (needed for GEO uploads) Gene-wise statistical analyses Not just the statistically-significant genes Collapse all analyses into one file, though All plots/etc Using clear filenames is critical Disk-space is not usually a critical concern here Your raw data will be much larger than your output!

  10. Most Important Points Do not delete things: Keep all old versions of your scripts by including the date in the filename (or using source-control) Version output files by date I have needed to go back to analyses done 7 years prior! Make regular (weekly) backups: Try to pass this work off to professional sysadmins External hard-drives/USBs are okay if you cannot get access to network drives, but try to automate

  11. Outline Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning

  12. Not All Experimental Designs Are Simple:Alternative Questions Are a large number of groups different? Do two things synergize? Are mRNA levels correlated to something? Are mRNA levels associated with survival? Can we use mRNA levels to predict things?

  13. General Linear Modeling The underlying mathematical framework for most statistical techniques we are familiar with: ANOVAs Logistic regression Linear regression Multiple regression Y = a0 + a1x1+ a2x2 + … NOT the same as a “Generalized Linear Model”!!!

  14. General Linear Modeling: Special Cases Y = a0 + a1x1 x1 continuous Linear Regression Y factorial Y = a0 + a1x1 Logistic Regression x1,x2 continuous Y = a0 + a1x1 + a2x2 Multiple Regression

  15. ANOVAs x1 factorial Y = a0 + a1x1 1-way ANOVA Y = a0 + a1x1 + a2x2 + a3x1x2 x1x2 two-level factors 2-way ANOVA

  16. ANOVA Experimental Designs Are Common Classic one-way ANOVAs: Treat a cell-line with 5 drugs – do any of them make a difference? Make 5 different genetic mutations – do any of them alter gene-expression? H0: the mean of at least one group differs Guesses at the assumptions?

  17. Assumptions Are Similar to T-test Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

  18. 1-Way ANOVAs in R (Part 1) # read the data normally raw.data <- ReadAffy(); eset <- expresso(…); # localize for readability expression.matrix <- exprs(eset); # have a list of groups groups <- as.factor( pData(eset)$x );

  19. 1-Way ANOVAs in R (Part 2) # loop over each gene for (i in 1:nrow(expression.matrix)) { # fit a one-way anova tmp <- aov(expression.matrix[i,] ~ x); # extract p-value pvalue <- summary(tmp)[[1]][1,5]; }

  20. But This Is Limited A 1-way ANOVA just says that one group differs Which one  post hoc tests No microarray-specific aspects here Note: connection to multiple-testing

  21. Sometimes 1-Way ANOVAs are not worth the Effort Mutation 1 Wildtype Mutation 2 1-way ANOVA + post hoc Or 2 t-tests?

  22. Not Always Testing Raw Data Vehicle 1 Drug 1 Vehicle 2 Drug 2 Drug 3 Vehicle 3 1-way ANOVA on the fold-changes 3 drugs with different controls

  23. Two-Way ANOVAs Probably even more common than one-way ANOVAs Very powerful: Synergy? Additivity? Antagonism? Y = a0 + a1x1 + a2x2 + a3x1x2 Assumptions?

  24. Assumptions Are Similar to 1-Way ANOVA Normal distribution for the dependent variable Samples are independent Homoscedasticity Independent variables are: Not correlated Random normal variables

  25. Do these treatments interact? Standard approach: ANOVA Interaction Treatment #2 Treatment #1

  26. ExampleRadiation Toxicity Some people are prone to late-stage radio-toxicity Does radiation induce specific patterns of gene-expression in these people? 3 Gy 3 Gy Radiation 0 Gy 0 Gy Radio-Sensitive

  27. Solution Fit an ANOVA model to each gene Most effects are due to radiation alone, minimal interaction

  28. Two-Way ANOVAs in R The limma package is one very good approach for this Alternatively standard model-fitting using the lm() function can be done for each gene We will cover each approach in Tutorial #3

  29. Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning Outline

  30. So Far We Have Considered Exactly-Defined Groups 80% 45% 15% 30% 85% 70% Six cell-lines with differential sensitivity to a drug What genes are associated with this phenomenon?

  31. Two Basic Approaches Correlation metrics Correlations Mutual Information Fit linear models with continuous variables Y= mRNA abundance a0 = basal level a1 = effect of drug x1 = drug sensitivity Y = a0 + a1x1

  32. Correlation Basics Start from the beginning, univariate statistics: Variance = Var(X) = E[(X – μX)2] Standard Deviation = [Var(X)]0.5 But if you have two variables, how are they related? Covariance = Cox(X,Y) = E[ (X – μX)(Y – μY) ] Correlation is a scaled form of the covariance

  33. Basic Properties of Correlations Unit-less Variance and covariance have squared units Standard deviation has normal units Range [-1.0,1.0] Range is independent of sample-size Range is independent of the range of X and Y Captures the degree in which two variables change together

  34. Relationship Types Correlation > 0 Variables positively correlated When one goes up, the other one tends to as well Correlation < 0 Variables negatively (inversely) correlated When one goes up, the other tends to go down Correlation = 0 No relationship NB: if variables are independent, then correlation = covariance = 0 NB: if correlation = covariance = 0, variables may be independent

  35. Pearson’s Correlation Most common correlation metric, R Measures linear relationship between two variables R = Cov(X, Y) / (σXσY)

  36. Pearson’s R Cannot Capture Non-Linear Relationships Correctly

  37. Spearman’s Rank-Order Correlation Second most-common correlation, ρ (Greek rho) Makes no assumptions about the relationships between variables Simplified version of Pearson’s R Works directly on ranks di = xi – yi (the differences between ranks) ρ= 1 – (6Σdi) / [n (n2 – 1) ]

  38. Spearman Example

  39. Outline Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning

  40. Survival Analysis A major new area in microarray analysis Works with any right-censored data Censoring: the value is only partially known Right-censoring: the value is at least this large Final outcome is not known: Patients are still alive at the time of the analysis An adverse drug-reaction has not happened yet Standard statistical approaches in use

  41. Typical Survival Curve

  42. Key Survival Statistics Cox proportional hazards model HR = hazard ratio P = probability the hazard ratio is not 1.0 Log-rank test Probability two curves differ

  43. Example Beer and coworkers studied non-small cell lung cancer using an older Affymetrix microarray: 12 samples of normal lung 83 samples of non-small cell lung cancer ~10,000 genes on their array Two questions: How many genes are associated with tumour-initiation? How many genes are associated with tumour-progression?

  44. Tumour Initiation: per-gene t-tests More genes repressed Fewer oncogenes?!

  45. Tumour Progression: per gene Cox models More genes are involved in helping a tumour resist treatment and grow larger than in “making” it in the first place! P < 0.05 733 Genes 230 Genes P < 0.01 63 Genes 136 Genes P < 0.001 2 Genes 15 Genes Progression Initiation

  46. Warning There are several assumptions to a Cox model: Non-parametric No assumptions made about “baseline hazard” Censoring must be independent of events You shouldn’t be more likely to lose follow-up information on patients who die Hazard must be proportional No changes across time In general you want to have a statistician around to ensure you are doing survival analyses correctly.

  47. Data Organization & Storage Two-Level Experimental Designs Continuous Variables Survival Analysis Meta-Analysis Machine-Learning Outline

  48. Meta-Analysis Combining results of multiple-studies that study related hypotheses Often used to merge data from different microarray platforms Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data

  49. Why Do Meta-Analysis? Can identify publication biases Appropriately weights diverse studies Sample-size Experimental-reliability Similarity of study-specific hypotheses to the overall one Increases statistical power Reduces information A single meta-analysis vs. five large studies Provides clearer guidance

  50. Challenges of Meta-Analysis No control for bias What happens if most studies are poorly designed? File-drawer problem Publication bias can be detected, but not explicitly controlled for How homogeneous is the data? Can it be fairly grouped? Simpson’s Paradox

More Related