290 likes | 566 Views
Differential Expression Analysis Multiple Hypotheses Testing. Xiaole Shirley Liu STAT115 / STAT215. Variance Stabilization in Differential Expression Analysis. Problem with estimating variance when the sample size is small (e.g. 3 treatments + 3 controls) Use a constant for all the genes?
E N D
Differential Expression AnalysisMultiple Hypotheses Testing Xiaole Shirley Liu STAT115 / STAT215
Variance Stabilization in Differential Expression Analysis • Problem with estimating variance when the sample size is small (e.g. 3 treatments + 3 controls) • Use a constant for all the genes? • Statistical Analysis of Microarrays (SAM) • Modified t*, increase based on of other genes on the array (i.e. lowest 5 percentile of ) • LIMMA: Smyth 2004
LIMMA: Design Matrix • Specifies RNA samples used on arrays • >Mat Treat1 Treat2 Control Sample1 1 0 0 Sample2 1 0 0 Sample3 1 0 0 Sample4 0 1 0 Sample5 0 1 0 Sample6 0 1 0 Sample7 0 0 1 Sample8 0 0 1 Sample9 0 0 1
LIMMA: Contrast Matrix • Specifies which comparisons are of interest • > contrast Treat1-Control Treat2-Control Treat1 1 0 Treat2 0 1 Control -1 -1 • Smooth genewisevariance towards a common (typical) value by borrowing information from all the genes, but allow flexibility for individual genes
LIMMA Hierarchical Model • Prior s0 in effect adds d0 extra arrays for estimating the variance of g
LIMMA Moderated T-test • Ordinary t-test • Moderated t-test with increased DoF j based on number of samples in the particular comparison
Multiple Hypotheses Testing • We test differential expression for every gene with p-value, e.g. 0.01 • For ~20 K genes on the array, potentially 0.01 x 20K = 200 genes wrongly called • H0: no diff expr; H1: diff expr • Reject H0: call something to be differential expressed • Should control family-wise error rate or false discovery rate
Family-Wise Error Rate • P(false rejection at most one hypothesis) < α P(no false rejection ) > 1- α • Bonferroni correction: to control the family-wise error rate for testing m hypotheses at level α, we need to control the false rejection rate for each individual test at α/m • If α is 0.05, for 20K gene prediction, p-value cutoff is 0.05/20K = 2.5E-6 • Too conservative for differential expressed gene selection
False Discovery Rate V: type I errors, false positives T: type II errors, false negatives FDR = V / R, FP / all called
False Discovery Rate • Less conservative than family-wise error rate • Benjamini and Hochberg (1995) method for FDR control, e.g. FDR ≤ * • Assume all the p-val from different tests are independent • Draw all m genes (x), ranked by p-val (y) • Draw line y = x * / m, x = 1…m • Call all the genes below the line
FDR Threshold Genes ranked by p-val p-value x * / m line index / m
Q-value • Storey & Tibshirani, PNAS, 2003 • Empirically derived q-value • Every p-value has its corresponding q-value (FDR) • FDR’s academic vs practical values
Gene Annotation • How to report differentially expressed genes or gene clusters? • Enriched for certain pathways, certain functions, or proteins localized in the same complex, etc? • Gene Ontology Consortium • Ashburner et al 1998 • Annotate gene function in the human genome • Now extended to many model organisms • Why do we care? • Effectively communicate biomedical knowledge • Organize and summarize annotations in structured way • Allow effective and meaningful computation on gene annotations
GO Categories • Molecular function • Describe gene’s jobs or abilities • E.g. transporters, transcription factor • Biological process • Events or pathways • E.g. cell differentiation, maturation, development • Cellular component • Describe locations (subcellular structures, macromolecular complexes) • E.g. nucleus, cell membrane, protein complexes
GO • Relationships: • Subclass: Is_a • Membership: Part_of • Topological: adjacent_to; Derivation: derives_from • E.g. 5_prime_UTR is part_of a transcript, and mRNA is_a kind of transcript • Same term could be annotated at multiple branches • Directed acyclic graph
Evaluate Differentially Expressed Genes • NetAffx mapped GO terms for all probesets Whole genome Up genes GO term X 100 80 Total 20K 200 • Statistical significance? • Binomial proportional test • p = 100 / 20 K = 0.005 • Check z table
Evaluate Differentially Expressed Genes Whole genome Up genes GO term X 100 80 Total 20K 200 • Chi sq test or Fisher’s exact test: Up !Up Total GO: 80 (1) 20 (99) 100 !GO: 120 (199) 20K-120 (19701) 20K-100 Total: 200 20K-200 20K • Check Chi-sq table
GO Tools for Microarray Analysis • http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools • Hundreds • DAVID
Gene Set Enrichment Analysis • In some microarray experiments comparing two conditions, there might be no single gene significantly diff expressed, but a group of genes slightly diff expressed • Check a set of genes with similar annotation (e.g. GO) and see their expression values • Kolmogorov-Smirnov test • GSEA at Broad Institute
Gene Set Enrichment Analysis • Mootha et al, PNAS 2003 • Kolmogorov-Smirnov test • Cumulative fraction function: What fraction of genes are below this fold change?
Gene Set Enrichment Analysis • Alternative to KS: one sample z-test • Population with all the genes follow normal ~ N(,2) • Avg of the genes (X) with a specific annotation: STAT115 03/18/2008
Gene Set Enrichment Analysis • Set of genes with specific annotation involved in coordinated down-regulation • Need to define the set before looking at the data • Can only see the significance by looking at the whole set
Expanded Gene Sets • Subramanian, et al PNAS 2005
Summary • LIMMA: use hierarchical model to stabilize gene-wise variance • FDR: adjust for multiple hypotheses testing • FWER, Benjamini-Hochberg, qvalue • GO Annotation, directed and acyclic • 3 categories, and simple relationships • Test for statistical enrichment • GSEA: use existing GO categories and other profile gene sets, KS tests