380 likes | 588 Views
Gene Set Analysis. 09/24/07. From individual gene to gene sets.
E N D
Gene Set Analysis 09/24/07
From individual gene to gene sets • Finding a list of differentially expressed genes is only the starting point. Suppose we have identified 500 genes that are differentially expressed, then what do we do about it? Can we learn something about the underlying biological pathway?
Sometimes one cannot find a single gene that is differentially expressed, as the statistical criteria are too stringent and/or the data is too noisy. Can we still learn something useful from the microarray experiment?
Gene set • A gene set contains genes that are functionally related. The gene set assignment is independent of the microarray data at hand. We want to know whether a gene set is differentially expressed. • Functional annotation is usually obtained from the following sources. • Kyoto Encyclopedia of Genes and Genomes (KEGG): • Gene Ontology (GO):
KEGG • KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for: • 1. Metabolism2. Genetic Information Processing3. Environmental Information Processing4. Cellular Processes5. Human Diseases and also on the structure relationships (KEGG drug structure maps) in: • 6. Drug Development • Website: http://www.genome.jp/kegg/
GO terms • Ontologies are 'specifications of a relational vocabulary'. • GO contains three structured vocabularies: cellular component, biological process and molecular function. • GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context. • Website: http://www.geneontology.org/
Over-representative analysis Not differentially expressed Differentially expressed Total a + b in S O1 = a O2 = b In SC c + d O3 = c O4 = d n b + d a + c • Null hypothesis: The genes in S are at most as often differentially expressed as the genes in Sc. Compare a/(a + b) with (a + c)/n.
Statistical significance • Chi-square test • Fisher’s exact test hypergeometric distribution
Testing multiple GO nodes simultaneously Determine significance level for each node The adjust for multiple hypothesis testing: FWER; FDR; etc. (GOSurfer)
Problems with using differentially expressed genes • Result is sensitive to the criteria for differentially expressed genes. Useless if the criteria is too stringent. • Reducing a continuous variable to binary variable loses useful quantitative information.
ErmineJ • Called FCS in Pavlidis et al. 2004. • The mean of –log(p-value) for all genes in a gene sets is used as a aggregate score. • Use permutation test (with gene) to obtain the p-value corresponding to the aggregate score. • Correction for multiple occurrence of a single gene. • Adjust for multiple-hypothesis testing by controlling FDR.
Randomize genes or arrays? Permute genes
Interpretation of p-values • In the gene-sampling setup (e.g., Chi-square test), inference is about a new sample of genes. Expression of genes are assumed to be independent. • In the subject-sampling setup (e.g., permutation test), inference is about a new subject. Label of a subject (treatment or control) is assumed to be independent. Expressions of different genes may be correlated. It is more biologically meaningful to use subject-sampling methods.
Gene Set Enrichment Analysis (GSEA) • Consider all genes instead of differentially expressed genes. • Permute class labels • Steps: • 1: Calculation of an enrichment score (ES). • 2: Estimation of significance level of ES. • 3: Adjustment for multiple hypothesis testing. (Mootha 2003)
Basic idea: Rank the genes according to their p-value for being differentially expressed. If there is no correlation between gene expression and membership in A or B, then the rank-distributions for the two sets should also be approximately equal. A B
Enrichment Score • Rank the genes by their p-values corresponding to the significance level of differential expression: R1, …, RN. • Define if Ri is not in S, and if Ri is in S. • Then that is, the maximum deviation from the expected running sum.
Why • Unbiased • Normalized
Permutation test of the significance of ES • Randomly assign labels to samples, reorder genes, and recompute ES(S). • Estimate the p-values by comparing the observed ES(S) with computed from randomly shuffled data.
Multiple hypothesis testing • Determine ES(S) for each gene set in the collection. • For each S and 1000 fixed permutations p of the array labels, reorder the genes and determine ES(S, p). • Adjust for variation in gene set size. • Compute FDR.
Applications of GSEA Data • 22,000 genes • 43 subjects: 17 normal (NGT), 8 partially impaired, 18 diagnosed with disease (DM2) • Gene sets independently curated from literature No single gene is differentially expressed according the stringent multiple hypothesis testing criteria.
Results from GSEA • Select the gene set with maximum ES: (OXPHOS) • Genes are consistently down-regulated, although the fold changes are moderate. • Selected gene sets are biologically sensible --- consistent with expection.
Starting point for further analysis • Apply clustering analysis to the selected gene set. • Many genes in the gene set are corregulated, suggesting they share similar functions.
A self-contained null hypothesis • Null hypothesis: • Competitive version: The genes in G are at most as often differentially expressed as the genes in Gc. • Self-contained version: No genes in G are differentially expressed. • “Self-contained” is more strict than “competitive”.
Drawback for comparing S against SC • This is compared to a “zero-sum-game”. Gene classes are competing with each other. The stronger the evidence in support of differential expression is for one class, the weaker the evidence for differential expression is judged to be for a second class.
Not significant? Drawback for comparing S against SC
Hcomp vs Hself • Advantage: • Self-consistent • When there are a large number of genes are differentially expressed, multiple pathways may be selected. • Drawback: • Too aggressive. A gene-class containing very few differentially expressed genes may not be biologically meaningful.
Hybrid methods • Several aspects of different methods can be mixed, e.g. • Modify GSEA by using self-contained version to evaluate p-value. • Similar treatment to ErmineJ. (J.J.Goeman and P.Buhlmann 2006)
Multivariate analysis • Let X1 and X2 be the expression levels for the subject groups 1 and 2. • Given a gene set containing q genes. • The self-contained null hypothesis can be rephrased as the multi-dimensional mean expression vectors (within the given gene set) are the same. • Use multivariate hypothesis testing.
Holstelling’s T2 Under the null hypothesis, T2 follows the F-distribution Multiple hypothesis testing is addressed by FDR control.
Dimension reduction Diagonalize the variance matrix S and then project to principle components. where Dimensions corresponding to very small eigenvalues are ignored.
Results • Figure 1 in Sek Kwon’s paper.