Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Lecture 8 Microarrays II: Data Analysis MBP1010 Dr. Paul C. Boutros Winter 2014 † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) DEPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others †

Course Overview • Lecture 1: What is Statistics? Introduction to R • Lecture 2: Univariate Analyses I: continuous • Lecture 3: Univariate Analyses II: discrete • Lecture 4: Multivariate Analyses I: specialized models • Lecture 5: Multivariate Analyses II: general models • Lecture 6: Sequence Analysis • Lecture 7: Microarray Analysis I: Pre-Processing • Lecture 8: Microarray Analysis II: Multiple-Testing • Lecture 9: Machine-Learning • Final Exam (written)

House Rules • Cell phones to silent • No side conversations • Hands up for questions

Topics For This Week • Examples • Attendance • Pre-Processing • QA/QC • Microarray-Specific Statistics • ProbeSet remapping • Organizing –omics studies

Example #1 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: tumours in mice lacking TS will be smaller than those in mice with amplification of OG, as assessed by post-mortem volume measurements of the primary tumour. Your data: OG (cm3) 5.2 1.9 5.0 6.1 4.5 4.8 TS (cm3) 3.9 7.1 3.1 4.4 5.0

Example #2 You are conducting a study of osteosarcomas using mouse models. You are studying transgenic animals with deletion of a tumour suppressor (TS), or with amplification of an oncogene (OG). You consider the penetrance of tumours in a set of 8 different mouse strains. Your hypothesis: some mouse strains are lead to bigger tumours than others when OG is amplified and only considering animals in which tumours form. You measure tumour volume in mm3 using calipers. Strain 1 (mm3) 91 69 83 Strain 2 (mm3) 201 70 71 Strain 3 (mm3) 15 36 20 Strain 4 (mm3) 52 52 53 Strain 5 (weeks) 11 538 59 Strain 6 (mm3) 6 60 63 Strain 7 (mm3) 85 79 70 Strain 8 (mm3) 100 105 121

Example #3 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS are less likely to respond to a novel targeted therapeutic (DX) than wildtype animals, as assessed by molecular imaging: TS (imaging response) Yes No Yes Yes No WT (imaging response) Yes Yes Yes Yes No Yes

Example #4 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Based on your previous data, you now hypothesize that mice lacking TS will show a similar molecular response to DX as those with amplification of OG. You use microarrays to study 20,000 genes in each line, and identify the following genes as changed between drug-treated and vehicle-treated: OG (DX-responsive genes) MYC KRAS CD53 CDH1 MUC1 MARCH1 PTEN IDH3 ESR2 RHEB CTCF STK11 MLL3 KEAP1 NFE2L2 ARID1A TS (DX-responsive genes) MYC KRAS CD53 CDH1 FBW1 SEPT7 MUC1 MUC3 MUC9 RNF3

Example #5 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice naturally susceptible to these tumours at ~20% penetrance. You are studying two transgenic lines, one with deletion of a tumour suppressor (TS), the other with amplification of an oncogene (OG). Tumour penetrance in these is 100%. Your hypothesis: You now wonder if tumour size is differing by age of the animal, and suspect tumour-size differs between lines, but is confounded by age differences. Your data: OG (cm3) 5.2 (17 weeks) 1.9 (9 weeks) 5.0 (15 weeks) 6.1 (15 weeks) 4.5 (21 weeks) 4.8 (20 weeks) Wildtype (cm3) 1.1 (9 weeks) 1.5 (10 weeks) 2.1 (15 weeks) 2.5 (15 weeks) 0.3 (17 weeks) 2.2 (21 weeks) TS (cm3) 3.9 (17 weeks) 7.1 (15 weeks) 3.1 (15 weeks) 4.4 (22 weeks) 5.0 (22 weeks)

Example #6 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS will acquire tumours sooner than wildtype mice. You test the mice weekly using ultrasound imaging. Your data: TS (week of tumour) 4 7 7 6 5 OG (week of tumour) 3 9 3 2 4 3

Summary Point #1:Microarray data is analyzed with a pipeline of sequential algorithms.This pipeline defines the standard workflow for microarray experiments.

Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Quantitation ?

Summary Point #2:This is an active research area.

Summary Point #3:These basic steps hold true for all microarray platforms and types.

What Is BioConductor? “Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data.” - BioConductor website The vast majority of our analyses will use BioConductor code, but there are clearly non-BioConductor approaches.

I’ve outlined the general workflow.Each technology and application has its own unique characteristics to consider.

Let’s Define an Affymetrix-Specific Workflow

Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Quantitation is done according to Affymetrix defaults with minimal user intervention. Quantitation One-Channel array Single-Channel array, so one simultaneous normalization procedure Typically ignored ?

Let’s Collapse This a Bit And Re-Phrase Things

Background Normalization ProbeSet Annotation Statistics Clustering Integration Spot List .CEL Files ?

First let’s go Back to Pre-Processing What exactly ispre-processing(aka normalization)? Why do we do it?

Sources of Technical Noise Where does technical noise come from?

More Sources of Technical Noise

Any step in the experimental pipeline can introduce artifactual noise Array design Array manufacturing Sample quality Sample identity  sequence effects? Sample processing Hybridization conditions  ozone? Scanner settings Pre-Processing tries to remove these systematic effects

Important Note Pre-processing is never a substitute for good experimental design. This is not a course on statistical design, but a few basic principles should be mentioned. Biological replicates are preferable to technical replicates. Always try to balance experimental groups. If processing samples identically is not possible, include controls for processing-effects.

Pre-Processing What exactly ispre-processing(aka normalization)? Why do we do it?

Sources of Technical Noise Where does technical noise come from?

More Sources of Technical Noise

Any step in the experimental pipeline can introduce artifactual noise Array design Array manufacturing Sample quality Sample identity  sequence effects? Sample processing Hybridization conditions  ozone? Scanner settings Pre-Processing tries to remove these systematic effects

Affymetrix Pre-Processing Steps Background Correction Normalization Probe-Specific Adjustment Summarizing multiple Probes into a single ProbeSet Let’s look at two common approaches

Introducing Two Major Affymetrix Pre-Processing Methods The two most commonly used methods are: RMA = Robust Multi-array MAS5 = Microarray Analysis Suite version 5 MAS5 has strengths & weaknesses Sacrifices precision for accuracy Can easily be used in clinical settings RMA has strengths & weaknesses Sacrifices accuracy for precision Challenging to integrate multiple studies Reduces variance (critical for small-n studies) Both are well accepted by journals and reviewers, perhaps RMA a bit more so. We’ll talk about some of the mathematics later on in this course.

Approach #1: MAS5 Affymetrix put significant effort into developing good data pre-processing approaches MAS5 was an attempt to develop a “standard” technique for 3’ expression arrays The flaws of MAS5 led to an influx of research in this area. The algorithm is best-described in an Affymetrix white-paper, and is actually quite challenging to reproduce exactly in R.

MAS5 Model Observations = True Signal + Random Noise + Probe Effects Assumptions?

What is RMA? RMA = Robust Multi-Array Why do we use a “robust” method? Robust summaries really improve over the standard ones by down weighing outliers and leaving their effects visible in residuals. Why do we use “array”? To put each chip’s values in the context of a set of similar values.

What is RMA? It is a log scale linear additive model Assumes all the chips have the same background distribution Does not use the mismatch probe (MM) data from the microarray experiments Why?

What is RMA? Mismatch probes (MM) definitely have information - about both signal and noise - but using it without adding more noise is a challenge We should be able to improve the background correction using MM, without having the noise level blow up: topic of current research (GCRMA) Ignoring MM decreases accuracy but increases precision

Methodology Quantile Normalization – the goal of this method is to make the distribution of probe intensities for each array in a set of arrays the same. This method is motivated by the idea that a Q-Q plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is anything else.

Methodology

Methodology Summarization: combining multiple probe intensities of each probeset to produce expression values An additive linear model is fit to the normalized data to obtain an expression measure for each probe on the GeneChip Yij = aj + βi + εij

Methodology Yij = aj + βi + εij Yij denotes the background-corrected normalized probe value corresponding to the ith GeneChip and the jth probe within the probeset [log2(PM-BG)*ij] aj is the probe affinity jth probe βi is the chip effect for the ith GeneChip (log scale expression level) εij is the random error term

Methodology Yij = aj + βi + εij • Estimate aj ( probe affinity) and βi (chip effect) using a robust method: • Tukey’s Median polish (quick) - fits iteratively, • successively removing row and column medians, • and accumulating the terms, until the process • stabilizes. The residuals are what is left at the end

RMA vs. MAS5 RMA sacrifices accuracy for precision RMA is generally not appropriate for clinical settings RMA provides higher sensitivity/specificity in some tests RMA reduces variance (critical for small-n studies) RMA is better accepted by journals and reviewers

Canadian Bioinformatics Workshops