400 likes | 421 Views
Genesets and Enrichment. Lecture 14 BF528. Instructor : Kritika Karri kkarri@bu.edu. Long list of DE Genes what happens next ???. Select some genes for validation Do some follow-up experiments Publish a huge table with results Try to learn about genes from published literature.
E N D
Genesets and Enrichment Lecture 14 BF528 Instructor: Kritika Karri kkarri@bu.edu
Long list of DE Genes what happens next ??? • Select some genes for validation • Do some follow-up experiments • Publish a huge table with results • Try to learn about genes from published literature
Introduction • Single gene analysis method instrumental in our understanding of cell-biological process. • However, in disease process, • it is not usually a single but a set of genes that are involved in the clinical manifestation of the disease. • It is more relevant to study the changes initiated by set of genes which can dramatically alter various cell biological and metabolic pathways. • Commonly used approaches to analyze a geneset • by over representation • aggregate score calculation.
“Enrichment” and Geneset • Enrichment “act of making fuller or meaningful” - (dictionary.com) • Geneset are predefined in literature or in databases: • Group of genes that share a similar function, pathway , cellular function etc. • Gene Enrichment: Combining information across genes to make sense of gene lists. • Geneset are enriched if experimental findings are in accordance with set of interest.
Gene Set Enrichment • Gene set enrichment is an approach to finding sets of biologically connected genes that are enriched for differential expression. • Gene set enrichment analysis (GSEA) • Statistical analysis to calculate the significance of gene set enrichment by comparing gene set distribution to “background distribution”
Why do enrichment analysis ? • Most array, sequencing, and screens produce • A measurement for most or all genes • List(s) of “interesting” genes • Most cellular processes involve sets of genes. • Can we compare the above two datasets? • Is the overlap different than expected? • Does this tell us something about cellular mechanisms? • Too many genes to examine in detail. • Are we biased? • How do we know that what we’re seeing is surprising?
Main Types of Enrichment Analysis • List‐based: inputs are • A subset of all genes chosen by some relevant method • A list of annotations, each linked to genes • Rank‐based: inputs are • A set of all genes ranked by some metric (ratio, foldchange, etc.) • A list of annotations, each linked to genes • List‐based with relationships: inputs are • A subset of all genes • A list of annotations, each linked to genes, organized in some relationship (e.g., a hierarchy)
Getting your list • Goal: Identify a list of genes (or probes) that appear to be working together in some way. • What identifiers to use? • Most common method: Get a list of differentially expressed genes • P‐value and/or fold change? • Threshold? • Alternatives: • Define a cluster • Sort data and/or apply a model to rank genes • Recommendations: • Try lists of varying length • Try to maximize signal / noise (What produces the smallest p‐values for enrichment?)
Annotation Sources • Gene Ontology (most popular) • KEGG; REACTOME pathways • Genes sharing a motif of regulated by the same protein/miRNA • Genes found on the same chromosome • Broad’s Molecular Signatures Database(MSigDB) • any grouping that is biologically sensible Will discuss in detail !!!!
Statistic to test for enrichment • Test for enrichment • Fisher’s exact • Hypergeometric • Binomial • Chi‐squared • Kolmogorov‐Smirnov • Permutation
Statistical Considerations • What is the chance of observing enrichment at least this extreme due to chance? • Different tests produce very different ranges of p-values • All look for over‐enrichment; some look for under-enrichment • Recommendation: • Use p‐values as a tool to rank genes but don’t take them literally • Most methods correct for multiple testing (e.g., with FDR), which is necessary
Things to consider when doing an enrichment analysis • Choose a tool that • Includes your species • Includes your gene / probe identifiers • Has up‐to‐date annotation • Lets you define your background (if possible) • Get recommendations from the usual sources. • Try at least a few tools. • Try lists of varying length. • Some recommended tools • DAVID • GSEA • BIOBASE (Whitehead has license) • BiNGO (uses Cytoscape) • GoMiner: http://discover.nci.nih.gov/gominer • GOstat: http://gostat.wehi.edu.au
Structure of GO • A way to capture biological knowledge for individual gene products in a written and computable form • A set of concepts and their relationships to each other arranged as a hierarchy. • Decedent terms are related to parents by either “is a” or “part of” relationships. • For example, the nucleus is part of a cell, whereas a neuron is a cell.
Need some statisticalsignificance .. • Majority of tools based on idea of identifying GO categories significantly enriched in list of differentially expressed genes. • Requires some threshold to define genes as ‘significant’ • GSEA takes a different approach by considering all assayed genes.
DAVID • Database for Annotation, Visualization and Integrated Discovery (NIAID) http://david.abcc.ncifcrf.gov/ • List‐based; Lots of identifiers; lots of species • Allows background definition • Statistic is a modified Fisher exact test
Overrepresentation vs Aggregate score • Over representation relies on the cutoff used in generating the gene set and it can vary considerably depending on the gene list. • long list of significant genes without any unifying biological theme. • The cutoff value is often arbitrary! • We are really examining only a handful of genes, totally ignoring much of the data • Aggregate score for each gene set based on the gene-specific scores for that gene set and overcomes the limitation of the former
Gene Set Enrichment Analysis (GSEA) • Detecting modest changes in gene expression datasets is • hard, due to: • the large number of variables, • the high variability between samples, and • the limited number of samples. • The goal of GSEA is to detect modest but coordinated changes in prespecified sets of related genes. • Such a set might include all the genes in a specific pathway,for instance.
GSEA Input Files • Gene expression dataset • [or alternatively, a ranked list of genes] • Phenotype labels • Discrete phenotypes – two or more • Continuous phenotypes, e.g. time series • Gene sets • Select an MSigDB gene set collection • Or supply a gene set file http://www.broadinstitute.org/gsea/
Sample Phenotype File • The GSEA algorithm works with both categorical labels and continuous labels: • A categorical label defines a discrete phenotype. (for example, ALL, MLL, and AML). • The GSEA algorithm analyzes two labels at time (for example, ALL versus MLL or ALL versus not_ALL). • A continuous label: • analyze a time series experiment • for example, that you have five samples taken at 30 minute intervals.
Geneset The Molecular Signatures Database (MSigDB) gene sets are divided into 5 major collections: http://software.broadinstitute.org/gsea/msigdb c1: positional gene sets c2: curated gene sets c3: motif gene sets c4: computational gene sets c5: GO gene sets C6: Oncogenic signatures C7: immunogenic signatures Hallmark geneset
GSEA Results Overview Enrichment at bottom of the list Enrichment at top of the list
Leading Edge Genes • Leading edge subset of a gene set = the genes that appear in the ranked list before the running sum reaches the max value. • Leading edge analysis = examine the genes that are in the leading edge subsets of the enriched gene sets. • For a negative ES, it is the set of members that appear subsequent to the peak score.
GSEA Statistic • Enrichment score (ES) reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. • positive ES indicates gene set enrichment at the top of the ranked list. • negative ES indicates gene set enrichment at the bottom of the ranked list. • Normalised Enrichment score (NES): accounts for differences in gene set size and in correlations between gene sets and the expression dataset. • can be used to compare analysis results across gene sets • false discovery rate (FDR) is the estimated probability that a gene set with a given NES represents a false positive finding. • For example, an FDR of 25% indicates that the result is likely to be valid 3 out of 4 times. • The nominal p value estimates the statistical significance of the enrichment score for a single gene set. • When you are evaluating multiple gene sets, you must correct for gene set size and multiple hypothesis testing.
Advantages of GSEA • Agnostic to the type of gene set and the source of annotation • Operates on any ordered gene list • Does not require the choice of a gene selection threshold or the explicit definition of a statistically significant marker set • Uses distribution-free, non-parametric, permutation-based test procedures with increased statistical power • Incorporates the permutation of phenotype labels thereby preserving the “biological” correlation structure of the markers • Takes into account multiple hypotheses testing of multiple gene sets.
BINGO • BiNGO: A Biological Network Gene Ontology tool http://www.psb.ugent.be/cbd/papers/BiNGO/ • Works with Cytoscape network visualization tool • Also permits custom annotation. • Shows relationship between annotation categories
Enrichr • The enrichment analysis tool http://amp.pharm.mssm.edu/Enrichr/ • Clustergrammer to produce dynamic heatmaps of enriched terms as columns and user input genes as rows • helps understand the relationships between their input genes and enriched terms.