690 likes | 999 Views
Case Study I: Two-Sample Analysis. Ru-Fang Yeh October 23, 2004 Genentech Hall Auditorium, Mission Bay, UCSF. Biological question. Experimental design. Microarray experiment. Failed. Quality Measurement. Image analysis. Preprocessing. Normalization. Pass. Sample/Condition
E N D
Case Study I: Two-Sample Analysis Ru-Fang Yeh October 23, 2004 Genentech Hall Auditorium, Mission Bay, UCSF
Biological question Experimental design Microarray experiment Failed Quality Measurement Image analysis Preprocessing Normalization Pass Sample/Condition Gene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : … Analysis Estimation Testing Annotation Clustering Discrimination Biological verification and interpretation Microarrays: Case Studies and Advanced Analysis
Image analysis CEL, CDF files gpr, gal files UCSF spot file • Short-oligonucleotide chip data: • quality assessment, • background correction, • probe-level normalization, • probe set summary • Two-color spotted array data: • quality assessment; diagnostic plots, • background correction, • array normalization. • Array CGH data: • quality assessment; diagnostic plots, • , background correction • clones summary; • array normalization. Quality assessment Pre-processing probes by sample matrix of log-ratios or log-intensities • Analysis of expression data: • Identify D.E. genes, estimation and testing, • clustering, • discrimination, and etc. Analysis Microarrays: Case Studies and Advanced Analysis
Biological Question: Molecular Phenotypic Difference in Rat Alveolar Type I and Type II Cells From “Freshly-isolated Rat Alveolar Type I Cells, Type II Cells, and Cultured Type II Cells Have Distinct Molecular Phenotypes.” (To appear, A J Phys) By Robert Gonzalez, Yee Hwa Yang, Chandi Griffin, Lennell Allen, Zachary Tigue, and Leland Dobbs.
Pulmonary Alveolar Epithelium Type II Cells Type I Cells Microarrays: Case Studies and Advanced Analysis
Alveolar Epithelial Type I and Type II Cells • Type I CellsType II Cells • % Lung cells ~8% ~15% • % Lung internal surface area ~98% ~2% • Volume / cell ~2000 µm3 ~400 µm3 • Surface / area ~5300 µm2 ~100 µm2 • Stone, AJRCMB 1992 • Morphologic characteristics conserved across the entire range of mammals. • Known/Possible - water and ion transport - surfactant metabolism • Functions - host defense (oxidants - ion transport • & microorganisms - produce immune • - tumor suppression effector molecules • - matrix preservation - Progenitor cells for TI cells after oxidant • injury (and in lung • development) Microarrays: Case Studies and Advanced Analysis
Alveolar Epithelial Cell Lineage Following Lung Injury Type II cell Transdifferentiation The process by which one “stable” (differentiated) cellular phenotype changes into a different “stable” cellular phenotype. Proliferation Evans, 1975 Adamson, 1975 Type I cell Microarrays: Case Studies and Advanced Analysis
Study Objectives Long term goals: Increase understanding of • alveolar epithelial cell lineages. • the mechanisms that regulate alveolar epithelial development and differentiation. Use microarrays to establish molecular profiles of TI and TII cells: • Identification of differences in expression of single genes to • provide additional marker genes • develop new hypotheses about cellular functions of each cell type • To determine changing patterns of expression of groups of genes • to understand processes of (trans)-differentiation in vivo and in vitro • to identify candidate factors (transcription cascades) important in regulating differentiation Microarrays: Case Studies and Advanced Analysis
Gene Expression Experiment TII Cells Cultured TII Cells TI Cells
Freshly Isolated TI and TII Cells TI cell fragment TII cell TII CELLS TI CELLS Microarrays: Case Studies and Advanced Analysis
Matrix (TCP, fibronectin) • Apical surface covered by liquid • Mechanical distention • Matrix (EHS, contracted collagen gels) • Soluble factors (ex: KGF) • Apical surface exposed to air • Mechanical contraction Type II Cellsin vitro Microarrays: Case Studies and Advanced Analysis
Experimental design • Probe: Affymetrix Rat U34 chip A, with 8799 probe sets. • Target: 4 biological replicates of each cell type: • TID0: freshly isolated TI cells • TIID0: freshly isolated TII cells • TIID7: cultured TII cells (for 7 days) [traditionally used as a model for TI day 0 cells] Cell purity criterion: < 2% cross-contamination Microarrays: Case Studies and Advanced Analysis
Preparing mRNA samples: Dissection of tissue RNA Isolation Amplification Probe labelling Hybridization Microarrays: Case Studies and Advanced Analysis
Preparing mRNA samples: Dissection of tissue Biological Replicate RNA Isolation Amplification Probe labelling Hybridization Microarrays: Case Studies and Advanced Analysis
Preparing mRNA samples: Dissection of tissue RNA Isolation Technical replicate Amplification Probe labelling Hybridization Microarrays: Case Studies and Advanced Analysis
Analysis Aims Main Questions:Establish molecular profiles of TI and TII cells: • Identification of differences in expression of single genes to: • provide additional marker genes • develop new hypotheses about cellular functions of each cell type. • To understand the process of (trans)-differentiation in vivo and in vitro • To identify candidate factors (transcription cascades) important in regulating differentiation. Approaches: • Identify differentially expressed (DE) genes between TID0 and TII D0. • Comparing TID0 and TIID7; are they similar? • Finding common regulatory element (transcription factor binding site) in groups of candidate co-regulated genes. Microarrays: Case Studies and Advanced Analysis
Biological question Experimental design Microarray experiment Failed Quality Measurement Image analysis Preprocessing Normalization Pass Sample/Condition Gene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : … Analysis Estimation Testing Annotation Clustering Discrimination Biological verification and interpretation Microarrays: Case Studies and Advanced Analysis
Preprocessing • Quality Assessment. • Background subtraction. • Normalization. • Summarization of probe sets value.
* * * * * High Density Oligonucleotide Arrays (Affymetrix) Hybridized Probe Cell Single stranded, labeled RNA target Oligonucleotide probe 24µm GeneChipProbe Array Millions of copies of a specific oligonucleotide probe per probe cell ~500,000 probe cells on each chip 1.28cm Image of Hybridized Probe Array Microarrays: Case Studies and Advanced Analysis
How Affymetrix Arrays Are Made Figure from Lipshutz et al. Nat. Gen. 1999. Microarrays: Case Studies and Advanced Analysis
For one gene (probe set): 16 probes/gene for Rat U34 mRNA reference sequence 3’ 5’ …TCGTCTGTATCACAGACACAAAGTTGACTG… PM: CAGACATAGTGTCTGTGTTTCAACT MM: CAGACATAGTGTGTGTGTTTCAACT PM MM Fluorescent probe intensity Microarrays: Case Studies and Advanced Analysis
Hybridization + Scanning DAT File Image analysis + CEL File CDF File • Preprocessing • 0. Quality Assessment. • Background subtraction (B). • Normalization (N). • Summarization of probe sets values (S). dChip MAS GCOS RMA GCRMA Text File Probe ID + Log2 (Intensity) CHP File Intensity value Absent / Present call Excel File Report File, quality Microarrays: Case Studies and Advanced Analysis
Quantile NormalizationBolstad et al (2003) • Quantile normalization is a method to make the distribution of probe intensities the same for every chip. • The normalization distribution is chosen by averaging each quantile across chips. Microarrays: Case Studies and Advanced Analysis
Probe Set Summarization:Robust Multi-array Average -- Irizarry et al (2003) • The RMA model assumes that each probe cell is made up of Log2 Normalized (Observed Intensity – Background) = Chip effect + Probe-specific effect + error • The expression level is estimated using a robust procedure (such as median polish or IRLS) to fit the above linear model. PM RMA values: log2 Expression for chip i Microarrays: Case Studies and Advanced Analysis
Summarization Method Comparison: AffyComphttp://affycomp.biostat.jhsph.edu/ average false positives if we use fold-change > 2 as a cut-off Median SD across replicates Microarrays: Case Studies and Advanced Analysis
Software • Affymetrix: MAS v5.1 or GCOS v1.0 • RMA (Robust Multi-array Average) / GCRMA / PLM: • Bioconductorhttp://www.bioconductor.org • affylmGUIhttp://bioinf.wehi.edu.au/affylmGUI/ • RMAExpresshttp://stat.www.berkeley.edu/~bolstad/RMAExpress/RMAExpress.html • Axon: Acuity (RMA only, commercial) • GeneTraffic (RMA only, commercial) • Li & Wong’s MBEI (Multiplicative Model-Based Expression Index): • dChiphttp://www.dchip.org/ Microarrays: Case Studies and Advanced Analysis
Qualitative Quality Assessment Using PLM Residuals Weights More QC Examples: http://stat-www.berkeley.edu/users/bolstad/PLMImageGallery/index.html Microarrays: Case Studies and Advanced Analysis
QC with affyPLM Microarrays: Case Studies and Advanced Analysis
QC with boxplots Microarrays: Case Studies and Advanced Analysis
RMAExpress Microarrays: Case Studies and Advanced Analysis
affylmGUI Microarrays: Case Studies and Advanced Analysis
Biological question Experimental design Microarray experiment Failed Quality Measurement Image analysis Preprocessing Normalization Pass Sample/Condition Gene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : … Analysis Estimation Testing Annotation Clustering Discrimination Biological verification and interpretation Microarrays: Case Studies and Advanced Analysis
Analysis Identify differentially expressed (DE) genes between TID0 and TII D0. Compare TID0 and TIID7. Beyond expression.
DE by Average Fold-Change (M): Freshly Isolated TI vs TII Cells ~ 8800 probe sets 50 + 131 > 4x (M>2) 163 + 401 > 2x Simple fold-change rules give no assessment of statistical significance Need to construct test statistics incorporating variability estimates (from replicates). TI 4x 2x M: 2x 4x TII A: Microarrays: Case Studies and Advanced Analysis
Two-sample t-statistic & p-value • The two-sample t-statistic is used to test equality of the group means 1, 2 • The p-valuep* is the probability that, under the null hypothesis (H0: 1=2), the test statistic is at least as extreme as the observed value t*. p*/2 p*/2 -t* t* Microarrays: Case Studies and Advanced Analysis
More Two-Sample Statistics Perform statistical tests on normalized, log-transformed data: • Standard t-test: assumes normally distributed data in each class (!), equal variances within classes • Welch t-test: as above, but allows unequal variances • Wilcoxon test: non-parametric, rank-based • permutation test: estimate the distribution of the test statistic under the null hypothesis by permutations of the sample labels Microarrays: Case Studies and Advanced Analysis
When there are few replicates… • (Fold-change) Averages can be driven by outliers. • T-statistics can be driven by tiny variances. Solution: “robust” version of t-statistic • Replace mean by median • Replace standard deviation by median absolute deviation Microarrays: Case Studies and Advanced Analysis
Alternative Test Statistics 1. Penalized-t Trying to find a compromise between solely using t and solely using mean. There are several similar solutions of the following form: where s = standard deviation. Question: how to estimate a? - 90th percentile of standard deviations (s values). Efron et al (2000). - minimizes the coefficient of variation(cv) of the absolute t-values (SAM). Tusher et al (2001) Microarrays: Case Studies and Advanced Analysis
Other Statistics (cont.) 2. Moderated t-statistics(G Smyth 2004, Limma): where is the shrinkage estimate of standard deviation Pooled sd from all genes sd for gene i Estimation is done using an extension to the empirical Bayes method in Lonnstadt &Speed (2002) Microarrays: Case Studies and Advanced Analysis
Other Statistics (cont.) 3.B-statistic: log posterior odds ratios log Pr(gene i IS DE) / Pr(gene iIS NOT DE) Equivalent to moderated-t in terms of ranking genes. 4. Single-channel methods modeling absolute gene-expression levels: - Newton et al 2001: log-intensities ~ Gamma - Wolfinger et al 2001: linear mixed model on log-intensities 5.Composite methods:Differential Expressed genes via Distance Synthesis(Yang et al 2004) to choose genes that are extreme on all measures by defining a “distance” statistic based on measures of choice. Microarrays: Case Studies and Advanced Analysis
DE by Fold Changes, (limma) Moderated-t, B (lods) Microarrays: Case Studies and Advanced Analysis
Assessing Significance I: Diagnostic Plots Microarrays: Case Studies and Advanced Analysis
Assessing Significance II: Testing Univariate hypothesis testing: For single gene, test the null hypothesis H0 : the gene is NOT differentially expressed. And p-value can be generated via theory or permutation tests. Is this p-value correct? • Yes if only looking at ONE gene… • Will expect 10000*0.01 = 100 genes with p-value < 0.01 in 10,000 non-DE genes! -- clearly we can’t just use standard p-value thresholds (.05, .01)! • Need to adjust p-values for meaningful interpretation! Microarrays: Case Studies and Advanced Analysis
(Unadjusted) p-values of moderated-t Microarrays: Case Studies and Advanced Analysis
Multiple Hypothesis Testing H0 Ha Microarrays: Case Studies and Advanced Analysis
Type I Error Rates (False Positives) • Family-Wise Error Rate (FWER) Pr(V > 0) = Pr( At least one false positive ) • False Discovery Rate (FDR) -- The FDR (Benjamini & Hochberg 1995) is the expected proportion of type I errors among the rejected hypotheses. FDR = E(Q), With Q = V/R, if R > 0 0, if R = 0 Microarrays: Case Studies and Advanced Analysis
Multiple Testing: Controlling a Type I Error Rate AIM: For a given type I error rate , use a procedure to select a set of “significant” genes that guarantee a type I error rate . Microarrays: Case Studies and Advanced Analysis
Adjusted p-values: Controlling the FWER • The Bonferroni correction: m pg ; most conservative adjustment. assume independence among genes. • Sidák: 1-(1-pg)m • minP (Westfall & Young): estimated through permutation; allow dependency between genes. • maxT:replace pgby test statistics Tg, min by max. Less computationally intensive than minP. • Step-down • Step-up Choosing all genes with adjusted p-value controls the FWER at level Microarrays: Case Studies and Advanced Analysis
Controlling the FDR (Benjamini/Hochberg) • Order unadjusted p-values: • To control FDR = E(V/R) at level , reject the hypothesis • Adjusted p-values: • Interpretation: expect 5% false positives among genes with < 0.05 FDR-adjusted p-values. Microarrays: Case Studies and Advanced Analysis
Adjusted p-values p=0.01 Microarrays: Case Studies and Advanced Analysis