1.06k likes | 1.07k Views
Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 8 Gene Expression Profiling. Paul Boutros Bioinformatics for Cancer Genomics May 26-30, 2014. Course Overview. 08:30 – 10:45 Expression Profiling in Cancer Genomics
E N D
Canadian Bioinformatics Workshops www.bioinformatics.ca
Module 8Gene Expression Profiling Paul Boutros Bioinformatics for Cancer Genomics May 26-30, 2014
Course Overview 08:30 – 10:45 Expression Profiling in Cancer Genomics Microarray Pre-Processing Basics 11:15 - 12:30 Guided Analysis of a Microarray Study
Learning Objectives of Module • Understand the types of microarrays that exist • Identify the sources of noise in a microarray experiment • Appreciate the complete microarray analysis pipeline • Learn to input raw microarray data into R/BioConductor • Learn to pre-process raw microarray data • Perform standard statistical analyses on microarray data
Let’s start off with a question What do expression microarrays actually measure?
Session Overview What are microarrays? What are microarrays used for? Molecular Aspects Biological Aspects Downstream Analyses How is microarray data analyzed? Workflow overview
What is a Microarray? “A DNA microarray is a multiplex technology consisting of thousands of oligonucleotide spots, each containing picomoles of a specific DNA sequence.” Used to quantitate mRNA or DNA Many applications: mRNA or DNA levels SNP identification ChIP-on-Chip
Hypotheses Microarrays are usually hypothesis-generating: They highlight specific genes or features that are particularly interesting for follow-up experiments There are many interesting exceptions Biomarkers Pathway analyses This does not reduce the importance of experimental design the low statistical power of array studies make good design even more important and very challenging
Input Samples The nature of the sample is critical: * Unfrozen vs. Frozen vs. FFPE * Total RNA vs. poly-A RNA vs. other subsets
Microarray Basics Imagine a one-spot microarray… Target DNA… … is labeled … and hybridized … and washed. Finally, scan the chip. Target Chip Feature Probe
These Are Spotted Arrays Robotically printed onto a series of glass slides using a robot with needle-heads. Product a characteristic gridding pattern and almost always use two samples simultaneously (two-colour).
Other Types of Arrays Inkjet Arrays Photolithographically generated arrays Bead arrays Protein/cell/lipid-arrays More “niche” applications Not discussed here
InkJet Arrays In 1999, HP spun off its life-science and measurement division into Agilent Technologies. The new company wanted to determine if printer technology could be harnessed to generate microarrays.
Inkjet Array Manufacture Involves Sequential Nucleotide Addition
Photolithographic Arrays Produced by the techniques for the production of transistors. Mostly pioneered by the company Affymetrix, although other suppliers exist (e.g. Nimblegen) We will be working with Affymetrix data later, so we will walk through the platform in significant detail
The Glass Matrix Addition of Linker molecule Silination
Photolithographic Synthesis Photolithographic mask
Final Chip Wafer Feature Chip
Self-Assembling Bead-Arrays Produced by Illumina 3 μm silicon beads, randomly placed coated with ~105 identical 25bp probes probes have identifying barcode (address) sequences Labeled cDNA bead address probe
Comparing Array Platforms Data Quality Price Oligos Bioinformatics Research Platform Spotted cDNA $ variable + +++ Affymetrix $$$ 25 bp +++ +++ $$ ~70 bp ++ ++ Inkjet Bead Arrays $$ ~25 bp ++ + I do not endorse specific platforms – they all have their strengths and weaknesses
Session Overview What are microarrays? What are microarrays used for? Molecular Aspects Biological Aspects Downstream Analyses How is microarray data analyzed? Workflow overview
What Are Microarrays Used For?Molecular mRNA abundances Splicing (quantitate different isoforms) mRNA degradation rates (half-life) mRNA translation rates RNA capture (RIP) DNA RNA Other • DNA sequence (SNPs) • DNA copy-number • DNA capture (exome, ChIP) • Tag quantitation (genetic screening) • Protein arrays • Cell based arrays • Lipid arrays
What Are Microarrays Used For?Biological mRNA abundances Splicing (quantitate different isoforms) mRNA degradation rates (half-life) mRNA translation rates RNA capture (RIP) RNA * Candidate Gene Identification * Pathway Analysis * Model Characterization * Classifiers/Predictive Models * Drug-Analysis (Dose/Time/Class) * Integration Analysis
Session Overview What are microarrays? What are microarrays used for? Molecular Aspects Biological Aspects Downstream Analyses (upcoming sessions: pathways & clinical integration) How is microarray data analyzed? Workflow overview
Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Each Spot is a Probe A) Remove Noise Quantitation B) Extract Data ?
Step #1: Image Quantitation Why? Quantitative vs. Qualitative How? Image Segmentation Difficulty? +++ Research?+
Image Segmentation 101:Find Grids 1. Find Grids 2. Find Spots 3. Spot Outline
Image Segmentation 101:Find Spots Key Step: Integrate Signal Across Array
Image Segmentation 101:Challenges Problems: Stray Signal Missing Spots Gross Deformities Manual Validation
Research? Surprisingly, not much investigation This is probably a source of error in all studies Manual checking of spot-detection remains the norm Problematic as studies & arrays get larger
Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Quantitation ?
Step #2: Background Correction Why? Remove Stray Signal How? Model-based Difficulty?++++ Research?++
Spot Segmentation Signal ??? Background
So what do we get? Background Intensity: BG Foreground Intensity: FG If BG > FG Then -ve Signal NO! Isn’t it simple? Signal = FG - BG 0.1-2% of spots
Why Might This happen? In 2001 two papers showed that empty spots have less signal than background Unbound spots correspond to low-expression genes Background Intensity: BG Foreground Intensity: FG Thus unbound spots are particularly prone to problems
So What to Do? Heavy-duty mathematical tools employed Three major models developed: Edwards log-linear Smyth normexp Kooperberg Bayesian The math is extremely advanced, so we’ll skip that for now. Let’s summarize the methods instead.
Comparison Speed Accuracy Method Good Edwards Fast Better NormExp Slow Best Kooperberg Very Slow No strong criteria for selecting between these algorithms.
Spot Cy3 Cy5 Background Spot Quality Inter-array Intra-Array Significance Testing Spot List Clustering Integration Quantitation ?
Step 3: Spot Quality Why? Identify artefacts How? Unknown Difficulty?+++++ Research?+