280 likes | 391 Views
Integrating Biology and Statistics: Gene Set Methods. BIOS 691-003 Winter/Spring 2010. Philosophical Overture. Integrating biology and statistics Gene sets: genes whose protein products collaborate on a well-defined function Vague! Hard to define ‘function’ or draw boundary on ‘gene sets’
E N D
Integrating Biology and Statistics: Gene Set Methods BIOS 691-003 Winter/Spring 2010
Philosophical Overture • Integrating biology and statistics • Gene sets: genes whose protein products collaborate on a well-defined function • Vague! • Hard to define ‘function’ or draw boundary on ‘gene sets’ • Statistical methods often ad-hoc • Be skeptical... but optimistic
Historical Motivations • Too many genes are significant • Researchers used to generate a list by p-value and comb for genes that work together • First pathway tools automated this process • Patterns may be more significant than any individual gene • e.g. if most genes in glycogen biosynthesis are up, but none is significant individually (after multiple-comparisons adjustment) • We can infer that glycogen is being made
Goals of Current Practice • Characterize biological meaning of joint changes in gene expression • Organize expression (or other) changes into meaningful ‘chunks’ (themes) • Identify crucial points in process where intervention could make a difference
Gene Sets • Gene Ontology • Biological Process • Molecular Function • Cellular Location • Pathway Databases • KEGG • BioCarta • MSIGDB • Broad Institute
Approaches • Univariate (most of current practice): • Discrete methods based on counting • Continuous methods: summarize gene test statistics by set • Multivariate (promising but unclear): • Compare differences to normal covariation of genes in groups across individuals • Use known biological relationships to construct test statistics
Univariate Approaches • Discrete tests: enrichment for groups in gene lists • Select genes differentially expressed at some cutoff • For each gene group cross-tabulate • Test for significance (Hypergeometric or Fisher test) • Continuous tests: from gene scores to group scores • Compare distribution of scores within each group to random selections • GSEA (Gene Set Enrichment Analysis) • PAGE (Parametric Analysis of Gene Expression)
Discrete Approach – 2 x 2 Table • For each set in turn construct 2 x 2 table of significance vs membership in set: P =
Significance Testing of Categories • Fisher’s Exact Test • Condition on margins fixed • Of all tables with same margins, how many have dependence as or more extreme? • Hard to compute when either n or k are large • Approximations • Binomial (when k/n is small) • Chi-square (when expected values > 5 ) • G2 (log-likelihood ratio; compare to c2 on 1 df)
Practical Issues – I • What is appropriate Null Distribution? • Highly correlated because many overlaps • Must do permutation analysis • How to permute? • Random sets of genes? Or • Random assignments of samples? • P-value or FDR? • Heuristic method • More constrained by annotation than statistics
Practical Issues – II • If a child category is declared significant, how to assess significance of parent category? • Include child category • Consider only genes external to child • In practice big categories are not useful • Small categories may not be well represented on chip • Select categories in middle range: 5-20 represented on chip
Critiques of Discrete Approach • No use of information about size of change • Large t scores count like small t’s • Continuous procedures have more power than discrete procedures on discretized continuous data
GSEA (Gene Set Enrichment Analysis) • Introduced in 2003 by Mootha to address a puzzle in a diabetes data set • No genes significant individually • But Oxidative Phosphorylation mostly up • GSEA tests rank of genes in a gene set against randomly distributed ranks • Kolmogorov-Smirnov test: • Maximum difference between ranks of genes in set and uniform distribution
Based on statistics of ‘Brownian Bridge’ random walk fixed end Maximum difference is test statistic Null distribution known Reformulated by GSEA as difference of CDF – uniform from axis Kolmogorov-Smirnov Test
K-S Test Finds Irrelevant Sets • Sometimes ranks concentrated in middle • K-S statistic high, but not meaningful for path change • Fix: ad-hoc weighting by actual t-scores emphasizes departures at extreme ends • No theory • Generate null distribution by permutation
Group Z- or T- Scores • PAGE: log fold-changes over all genes follow ‘close to’ Normal distribution • Can estimate s from overall distribution • T-Profiler: under Null Hypothesis, each gene’s t-score follows t distribution ‘near’ N(0,1) distribution • Hence the sum over genes in a specific set G: • PAGE: T-profiler: • If most genes in a pathway are up-regulated then gene set scores will be significantly high
Issues and Critiques • Same issues as discrete approach • Null distribution by permuting samples • GSEA finally gets that right in 2005 • Null distribution for Z-test assumes IID • Methods assume all meaningful changes in same direction • Don’t use information about normal co-variation
Why Is Covariation Important? • Most cellular processes are homeostatic: • They find a good functional set-point • Coping with variation in inputs … • … AND in specific regulatory couplings • Most of us have regulatory SNP’s that vary expression by a factor of two or more • Other genes are expressed at somewhat different levels to accommodate key processes
Multivariate Approaches • Classical multivariate methods • Multi-dimensional Scaling • Hotelling’s T2 • Machine learning approaches • Topological score relative to network • Prediction by machine learning tool • e.g. ‘random forest’
PCA PCA1 lies along the direction of maximal correlation; PCA 2 at right angles with the next highest variation. Three correlated variables
Multi-Dimensional Scaling • Aim: to represent graphically the most information about relationships among samples with multi-dimensional attributes in 2 (or 3) dimensions • Algorithm: • Transform distances into cross-product matrix • Initial PCA onto 2 (or 3) axes • Deform until better representation • Minimize ‘strain’ measure:
Separating Using MDS Left: distributions of individual variables Right: MDS plot (in this case PCA)
MDS for Pathways • BAD pathway: controlled cell death Normal IBC Other BC • Clear separation between groups • Cancer samples don’t have coherent variation
Hotelling’s T2 • Compute distance between sample means using (common) metric of covariation • Where • Multidimensional analog of t (actually F) statistic
Principles of Kong et al Method • Normal covariation generally acts to preserve homeostasis • The transcription of genes that participate in many processes will be changed • The joint changes in genes will be most distinctive for those genes active in pathways that are working differently
Issues • Not robust to outliers • In practice this may not matter much (?) • Assumes same covariance in each sample • Small samples -> unreliable S estimates • Loss of power • Robust / Regularized Methods improve sensitivity by up to a factor of 10! • Yates & Reimers (in prep)
Overall Assessment • Gene sets are somewhat arbitrary • Most ‘modules’ overlap extensively with others • Many ‘modules’ act by protein modification rather than gene expression • Current methods represent a first attempt to bring biological information to bear on the significance problem