400 likes | 426 Views
Explore the concepts of gene sets and enrichment analysis, crucial in understanding complex disease processes involving sets of genes. Learn about statistical methods, annotation sources, and tools for enrichment analysis.
E N D
Genesets and Enrichment Lecture 14 BF528 Instructor: Kritika Karri kkarri@bu.edu
Long list of DE Genes what happens next ??? • Select some genes for validation • Do some follow-up experiments • Publish a huge table with results • Try to learn about genes from published literature
Introduction • Single gene analysis method instrumental in our understanding of cell-biological process. • However, in disease process, • it is not usually a single but a set of genes that are involved in the clinical manifestation of the disease. • It is more relevant to study the changes initiated by set of genes which can dramatically alter various cell biological and metabolic pathways. • Commonly used approaches to analyze a geneset • by over representation • aggregate score calculation.
“Enrichment” and Geneset • Enrichment “act of making fuller or meaningful” - (dictionary.com) • Geneset are predefined in literature or in databases: • Group of genes that share a similar function, pathway , cellular function etc. • Gene Enrichment: Combining information across genes to make sense of gene lists. • Geneset are enriched if experimental findings are in accordance with set of interest.
Gene Set Enrichment • Gene set enrichment is an approach to finding sets of biologically connected genes that are enriched for differential expression. • Gene set enrichment analysis (GSEA) • Statistical analysis to calculate the significance of gene set enrichment by comparing gene set distribution to “background distribution”
Why do enrichment analysis ? • Most array, sequencing, and screens produce • A measurement for most or all genes • List(s) of “interesting” genes • Most cellular processes involve sets of genes. • Can we compare the above two datasets? • Is the overlap different than expected? • Does this tell us something about cellular mechanisms? • Too many genes to examine in detail. • Are we biased? • How do we know that what we’re seeing is surprising?
Main Types of Enrichment Analysis • List‐based: inputs are • A subset of all genes chosen by some relevant method • A list of annotations, each linked to genes • Rank‐based: inputs are • A set of all genes ranked by some metric (ratio, foldchange, etc.) • A list of annotations, each linked to genes • List‐based with relationships: inputs are • A subset of all genes • A list of annotations, each linked to genes, organized in some relationship (e.g., a hierarchy)
Getting your list • Goal: Identify a list of genes (or probes) that appear to be working together in some way. • What identifiers to use? • Most common method: Get a list of differentially expressed genes • P‐value and/or fold change? • Threshold? • Alternatives: • Define a cluster • Sort data and/or apply a model to rank genes • Recommendations: • Try lists of varying length • Try to maximize signal / noise (What produces the smallest p‐values for enrichment?)
Annotation Sources • Gene Ontology (most popular) • KEGG; REACTOME pathways • Genes sharing a motif of regulated by the same protein/miRNA • Genes found on the same chromosome • Broad’s Molecular Signatures Database(MSigDB) • any grouping that is biologically sensible Will discuss in detail !!!!
Statistic to test for enrichment • Test for enrichment • Fisher’s exact • Hypergeometric • Binomial • Chi‐squared • Kolmogorov‐Smirnov • Permutation
Statistical Considerations • What is the chance of observing enrichment at least this extreme due to chance? • Different tests produce very different ranges of p-values • All look for over‐enrichment; some look for under-enrichment • Recommendation: • Use p‐values as a tool to rank genes but don’t take them literally • Most methods correct for multiple testing (e.g., with FDR), which is necessary
Things to consider when doing an enrichment analysis • Choose a tool that • Includes your species • Includes your gene / probe identifiers • Has up‐to‐date annotation • Lets you define your background (if possible) • Get recommendations from the usual sources. • Try at least a few tools. • Try lists of varying length. • Some recommended tools • DAVID • GSEA • BIOBASE (Whitehead has license) • BiNGO (uses Cytoscape) • GoMiner: http://discover.nci.nih.gov/gominer • GOstat: http://gostat.wehi.edu.au
Structure of GO • A way to capture biological knowledge for individual gene products in a written and computable form • A set of concepts and their relationships to each other arranged as a hierarchy. • Decedent terms are related to parents by either “is a” or “part of” relationships. • For example, the nucleus is part of a cell, whereas a neuron is a cell.
Need some statisticalsignificance .. • Majority of tools based on idea of identifying GO categories significantly enriched in list of differentially expressed genes. • Requires some threshold to define genes as ‘significant’ • GSEA takes a different approach by considering all assayed genes.
DAVID • Database for Annotation, Visualization and Integrated Discovery (NIAID) http://david.abcc.ncifcrf.gov/ • List‐based; Lots of identifiers; lots of species • Allows background definition • Statistic is a modified Fisher exact test
Overrepresentation vs Aggregate score • Over representation relies on the cutoff used in generating the gene set and it can vary considerably depending on the gene list. • long list of significant genes without any unifying biological theme. • The cutoff value is often arbitrary! • We are really examining only a handful of genes, totally ignoring much of the data • Aggregate score for each gene set based on the gene-specific scores for that gene set and overcomes the limitation of the former
Gene Set Enrichment Analysis (GSEA) • Detecting modest changes in gene expression datasets is • hard, due to: • the large number of variables, • the high variability between samples, and • the limited number of samples. • The goal of GSEA is to detect modest but coordinated changes in prespecified sets of related genes. • Such a set might include all the genes in a specific pathway,for instance.
GSEA Input Files • Gene expression dataset • [or alternatively, a ranked list of genes] • Phenotype labels • Discrete phenotypes – two or more • Continuous phenotypes, e.g. time series • Gene sets • Select an MSigDB gene set collection • Or supply a gene set file http://www.broadinstitute.org/gsea/
Sample Phenotype File • The GSEA algorithm works with both categorical labels and continuous labels: • A categorical label defines a discrete phenotype. (for example, ALL, MLL, and AML). • The GSEA algorithm analyzes two labels at time (for example, ALL versus MLL or ALL versus not_ALL). • A continuous label: • analyze a time series experiment • for example, that you have five samples taken at 30 minute intervals.
Geneset The Molecular Signatures Database (MSigDB) gene sets are divided into 5 major collections: http://software.broadinstitute.org/gsea/msigdb c1: positional gene sets c2: curated gene sets c3: motif gene sets c4: computational gene sets c5: GO gene sets C6: Oncogenic signatures C7: immunogenic signatures Hallmark geneset
GSEA Results Overview Enrichment at bottom of the list Enrichment at top of the list
Leading Edge Genes • Leading edge subset of a gene set = the genes that appear in the ranked list before the running sum reaches the max value. • Leading edge analysis = examine the genes that are in the leading edge subsets of the enriched gene sets. • For a negative ES, it is the set of members that appear subsequent to the peak score.
GSEA Statistic • Enrichment score (ES) reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. • positive ES indicates gene set enrichment at the top of the ranked list. • negative ES indicates gene set enrichment at the bottom of the ranked list. • Normalised Enrichment score (NES): accounts for differences in gene set size and in correlations between gene sets and the expression dataset. • can be used to compare analysis results across gene sets • false discovery rate (FDR) is the estimated probability that a gene set with a given NES represents a false positive finding. • For example, an FDR of 25% indicates that the result is likely to be valid 3 out of 4 times. • The nominal p value estimates the statistical significance of the enrichment score for a single gene set. • When you are evaluating multiple gene sets, you must correct for gene set size and multiple hypothesis testing.
Advantages of GSEA • Agnostic to the type of gene set and the source of annotation • Operates on any ordered gene list • Does not require the choice of a gene selection threshold or the explicit definition of a statistically significant marker set • Uses distribution-free, non-parametric, permutation-based test procedures with increased statistical power • Incorporates the permutation of phenotype labels thereby preserving the “biological” correlation structure of the markers • Takes into account multiple hypotheses testing of multiple gene sets.
BINGO • BiNGO: A Biological Network Gene Ontology tool http://www.psb.ugent.be/cbd/papers/BiNGO/ • Works with Cytoscape network visualization tool • Also permits custom annotation. • Shows relationship between annotation categories
Enrichr • The enrichment analysis tool http://amp.pharm.mssm.edu/Enrichr/ • Clustergrammer to produce dynamic heatmaps of enriched terms as columns and user input genes as rows • helps understand the relationships between their input genes and enriched terms.