Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 8Gene Expression Profiling Paul Boutros Bioinformatics for Cancer Genomics May 26-30, 2014

Course Overview 08:30 – 10:45 Expression Profiling in Cancer Genomics Microarray Pre-Processing Basics 11:15 - 12:30 Guided Analysis of a Microarray Study

Learning Objectives of Module • Understand the types of microarrays that exist • Identify the sources of noise in a microarray experiment • Appreciate the complete microarray analysis pipeline • Learn to input raw microarray data into R/BioConductor • Learn to pre-process raw microarray data • Perform standard statistical analyses on microarray data

Let’s start off with a question What do expression microarrays actually measure?

Session Overview What are microarrays? What are microarrays used for? Molecular Aspects Biological Aspects Downstream Analyses How is microarray data analyzed? Workflow overview

What is a Microarray? “A DNA microarray is a multiplex technology consisting of thousands of oligonucleotide spots, each containing picomoles of a specific DNA sequence.” Used to quantitate mRNA or DNA Many applications: mRNA or DNA levels SNP identification ChIP-on-Chip

Hypotheses Microarrays are usually hypothesis-generating: They highlight specific genes or features that are particularly interesting for follow-up experiments There are many interesting exceptions Biomarkers Pathway analyses This does not reduce the importance of experimental design the low statistical power of array studies make good design even more important and very challenging

Input Samples The nature of the sample is critical: * Unfrozen vs. Frozen vs. FFPE * Total RNA vs. poly-A RNA vs. other subsets

Microarray Basics Imagine a one-spot microarray… Target DNA… … is labeled … and hybridized … and washed. Finally, scan the chip. Target Chip Feature Probe

These Are Spotted Arrays Robotically printed onto a series of glass slides using a robot with needle-heads. Product a characteristic gridding pattern and almost always use two samples simultaneously (two-colour).

Other Types of Arrays Inkjet Arrays Photolithographically generated arrays Bead arrays Protein/cell/lipid-arrays More “niche” applications Not discussed here

InkJet Arrays In 1999, HP spun off its life-science and measurement division into Agilent Technologies. The new company wanted to determine if printer technology could be harnessed to generate microarrays.

Inkjet Array Manufacture Involves Sequential Nucleotide Addition

Photolithographic Arrays Produced by the techniques for the production of transistors. Mostly pioneered by the company Affymetrix, although other suppliers exist (e.g. Nimblegen) We will be working with Affymetrix data later, so we will walk through the platform in significant detail

The Glass Matrix Addition of Linker molecule Silination

Photolithographic Synthesis Photolithographic mask

Deprotection

Nucleotide Addition

Capping Agents

Final Chip Wafer Feature Chip

RNA Wash

An Affymetrix Microarray

Self-Assembling Bead-Arrays Produced by Illumina 3 μm silicon beads, randomly placed coated with ~105 identical 25bp probes probes have identifying barcode (address) sequences Labeled cDNA bead address probe

Comparing Array Platforms Data Quality Price Oligos Bioinformatics Research Platform Spotted cDNA $ variable + +++ Affymetrix $$$ 25 bp +++ +++ $$ ~70 bp ++ ++ Inkjet Bead Arrays $$ ~25 bp ++ + I do not endorse specific platforms – they all have their strengths and weaknesses

Session Overview What are microarrays? What are microarrays used for? Molecular Aspects Biological Aspects Downstream Analyses How is microarray data analyzed? Workflow overview

What Are Microarrays Used For?Molecular mRNA abundances Splicing (quantitate different isoforms) mRNA degradation rates (half-life) mRNA translation rates RNA capture (RIP) DNA RNA Other • DNA sequence (SNPs) • DNA copy-number • DNA capture (exome, ChIP) • Tag quantitation (genetic screening) • Protein arrays • Cell based arrays • Lipid arrays

What Are Microarrays Used For?Biological mRNA abundances Splicing (quantitate different isoforms) mRNA degradation rates (half-life) mRNA translation rates RNA capture (RIP) RNA * Candidate Gene Identification * Pathway Analysis * Model Characterization * Classifiers/Predictive Models * Drug-Analysis (Dose/Time/Class) * Integration Analysis

Session Overview What are microarrays? What are microarrays used for? Molecular Aspects Biological Aspects Downstream Analyses (upcoming sessions: pathways & clinical integration) How is microarray data analyzed? Workflow overview

Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Each Spot is a Probe A) Remove Noise Quantitation B) Extract Data ?

Step #1: Image Quantitation Why? Quantitative vs. Qualitative How? Image Segmentation Difficulty? +++ Research?+

Image Segmentation 101:Find Grids 1. Find Grids 2. Find Spots 3. Spot Outline

Image Segmentation 101:Find Spots Key Step: Integrate Signal Across Array

Image Segmentation 101:Challenges Problems: Stray Signal Missing Spots Gross Deformities Manual Validation

Research? Surprisingly, not much investigation This is probably a source of error in all studies Manual checking of spot-detection remains the norm Problematic as studies & arrays get larger

Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Quantitation ?

Step #2: Background Correction Why? Remove Stray Signal How? Model-based Difficulty?++++ Research?++

Spot Segmentation Signal ??? Background

So what do we get? Background Intensity: BG Foreground Intensity: FG If BG > FG Then -ve Signal NO! Isn’t it simple? Signal = FG - BG 0.1-2% of spots

Why Might This happen? In 2001 two papers showed that empty spots have less signal than background Unbound spots correspond to low-expression genes Background Intensity: BG Foreground Intensity: FG Thus unbound spots are particularly prone to problems

So What to Do? Heavy-duty mathematical tools employed Three major models developed: Edwards log-linear Smyth normexp Kooperberg Bayesian The math is extremely advanced, so we’ll skip that for now. Let’s summarize the methods instead.

Comparison Speed Accuracy Method Good Edwards Fast Better NormExp Slow Best Kooperberg Very Slow No strong criteria for selecting between these algorithms.

Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Quantitation ?

Step 3: Spot Quality Why? Identify artefacts How? Unknown Difficulty?+++++ Research?+

Canadian Bioinformatics Workshops