390 likes | 509 Views
Gene expression data: Questions, answers and statistics. Terry Speed and Yee Hwa Yang Department of Statistics UC Berkeley Genetics and Bioinformatics, Walter & Eliza Hall Institute of Medical research. Overview. Questions involving microarray data. Different experimental designs
E N D
Gene expression data:Questions, answers and statistics Terry Speed and Yee Hwa Yang Department of Statistics UC Berkeley Genetics and Bioinformatics, Walter & Eliza Hall Institute of Medical research
Overview • Questions involving microarray data. • Different experimental designs • Case studies, including • Olfactory epithelium, • Olfactory bulb, • Identification of differentially expressed genes, • Pattern searching.
Questions and answers: a point of view Biological questions first, then statistical methods (design, analysis) and thinking, leading to tentative answers, together with an assessment of the uncertainty in those answers Rather than beginning with: Purely exploratory analyses, or modelling either processes or data Something of each of the last two comes into most statistical analyses, but only after focussing on biological questions
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation
Which genes are (relatively) up/down regulated? Samples: liver tissue from each of two kinds of mice, e.g. KO vs. WT, or mutant vs. WT n T C n • For each gene form the t statistic: • average of n trt Ms • sqrt(1/n (SD of n trt Ms)2)
Which genes are (relatively) up/down regulated? Samples : as before, but also pooled control liver tissue n T C* n C* C • For each gene form the t statistic: • average of n trt Ms - average of n ctl Ms • sqrt(1/n (SD of n trt Ms)2 + (SD of n ctl Ms)2)
Samples:Liver tissue from mice treated by cholesterol modifying drugs. Question 1:Find genes that respond differently between the treatment and the control. Question 2: Find genes that respond similarly across two or more treatments relative to control. Multiple comparisons of interest T2 T3 T4 T1 x 2 x 2 x 2 x 2 C
Interaction? Samples: treated cell lines at 4 time points (30 minutes, 1 hour, 4 hours, 24 hours) Question: Which genes contribute to the enhanced inhibitory effect of OSM when it is combined with EGF? Role of time? ctl OSM 4 times OSM & EGF EGF
Gene Expression Data Gene expression data on 1,2,3,4,5,... genes for 5 slides Slide (experiment) slide1 slide2 slide3 slide4 slide5 1 0.46 0.30 0.80 1.51 0.90 2 -0.10 0.49 0.24 0.06 0.46 3 0.15 0.74 0.04 0.10 0.20 4 -0.45 -1.03 -0.79 -0.56 -0.32 5 -0.06 1.06 1.35 1.09 -1.09 Genes Gene expression level of gene i on slide j = Log2( Red intensity / Green intensity) Sometimes a common reference, e.g. green, sometimes not.
Olfactory epithelium GOAL: Exploratory study to identify genes with altered expression between zone 1 and zone 4 of the olfactory epithelium for new born (P0) and adult (A) mice. Tissue samples: P01 : Zone 1 of epithelium from P0 mouse. P04 : Zone 4 of epithelium from P0 mouse. A1 : Zone 1 of epithelium from adult mouse. A4 : Zone 4 of epithelium from adult mouse. Probes: ~19,000 mouse cDNAs.
Factorial Design: as completed Age Effect 2 P01 A1 4 Zone Effect 1 3 5 P04 A 4
Layout of the cDNA microarrays • Made in Ngai lab, UC Berkeley • Mouse ESTs, 19,200 spots. • Two different print groups, each with • 4 x 4 grid, each with • 25 x24 spots • Controls on the first 2 rows of each grid. 77 pg1 pg2
Two slides P04 vs. P01 (pg2) A1 vs. P01 (pg2)
Preprocessing - Image Analysis 1. Addressing: locate centers 2. Segmentation: classification of pixels either as signal or background. using seeded region growing). 3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures. Results from SRG from P04 vs. P01
Preprocessing: after image analysis Where necessary, we carry out: Colournormalization (location and scale) within slides, possibly within pin-groups, Scale normalization between slides, A variety of other adjustments, e.g. to remove spatial artifacts.
Factorial design m m+a Different ways of estimating parameters. e.g. Zeffect. 1 = (m + z) - (m) = z 2 - 5 = ((m + a) - (m)) -((m + a)-(m + z)) = (a) - (a + z) = z 4 + 3 - 5 =…= z 2 P01 A1 4 1 3 5 P04 A 4 m+z m+z+a+za How do we combine the information?
Regression analysis Define a matrix X so that E(M)=X, see below. Use least squares estimate for z, a, za for each gene.
Estimates of zone effects log(zone 4 / zone1) vs ave A gene A gene B = average log√(R*G)
Estimates of zone effects vs SE Z effect • • t = / SE • t Log2(SE)
Estimates of age effects vs estimates of zone effects Zone Age Zone Age
Top 50 genes from each effect Zone . Age interaction Age 19 0 48 29 2 0 19 Zone
In situ hybridization image Gene A (up-regulated in zone 4)
1-year old statement by our collaborator • Comparison of large regions of olfactory bulb fails to yield molecular differences. • Molecules involved in target recognition may be expressed in a limited subset of cells. A new approach is required that possesses high sensitivity and throughput of analysis.
Samples: tissues from different regions of the olfactory bulb. Question 1:differences between different regions. Question 2: identify genes with pre-specified patterns across regions. Note: novel design (controversial?) The olfactory bulb experiments M A V D P L
Regression analysis Define a matrix X so that E(M)=X Use least squares estimates for A-L, P-L, D-L, V-L, M-L.
Contrasts -- We can estimate all 15 different comparisons directly and/or indirectly: e.g. D - M = (D - L) - (M - L) -- For every gene we have a pattern based on the 15 different comparisons. e.g. Gene #5699,
Genes that share the same pattern Find genes with smallest Euclidean distance to gene #5699 (whatever it is: another story). The second gene is a replicate of the first.
How the question got refined After the design and carrying out of the experiment, and the initial analysis and follow-up in situ hybridizations to confirm our findings, we realized we had failed to perceive the most interesting question, which was Find genes whose expression patterns show (spatial) restriction across the bulb, i.e. not just gradients (differential expression), but localization.
Statistical collaborators Yee Hwa Yang (Berkeley) Sandrine Dudoit (Stanford) Ingrid Lönnstedt (Uppsala) Natalie Thorne (WEHI) CSIRO Image Analysis Group Michael Buckley Ryan Lagerstorm Ngai Lab (Berkeley) Cynthia Duggan Jonathan Scolnick Dave Lin Vivian Peng Percy Luu Elva Diaz John Ngai LBNL Matt Callow Acknowledgments
Some web sites: Technical reports, talk, software etc. http://www.stat.berkeley.edu/users/terry/zarray/Html/ Statistical software R “GNU’s S” http://lib.stat.cmu.edu/R/CRAN/ Packages within R environment: -- Spot http://www.cmis.csiro.au/iap/spot.htm -- SMA (statistics for microarray analysis) http://www.stat.berkeley.edu/users/terry/zarray/Html