1.12k likes | 1.14k Views
Gene expression studies of cancer: gene transcription signatures. Chad Creighton February 2009. Oncogenic signaling pathways in cancer. Mutation/deregulation of a handful of genes can make cells into cancer cells. Hanahan and Weinberg. Cell. 2000 100:57-70.
E N D
Gene expression studies of cancer: gene transcription signatures Chad Creighton February 2009
Oncogenic signaling pathways in cancer Mutation/deregulation of a handful of genes can make cells into cancer cells. Hanahan and Weinberg. Cell. 2000 100:57-70
Widespread deregulation of gene expression in cancer • Gene expression profiling distinguishes prostate cancer from normal prostate and from BPH. Dhanasekaran et al. Nature. 2001 Aug 23;412(6849):822-6.
Widespread deregulation of gene expression in cancer • Gene expression profiling identifies different subtypes of breast cancer. Sorlie et al. PNAS. 2003 100(14):8418-23
A gene-expression signature as a predictor of survival in breast cancer www.agendia.com Van de Vijver et al. NEJM 2002 347(25):1999-2009.
Oncogenic pathway signatures in human cancers as a guide to targeted therapies • Use oncogenic signatures to predict response of cell lines to targeted therapy. Bild et al. Nature. 2006 439(7074):353-7.
Oncogenic signatures of ERBB2, EGFR, MEK, RAF, and MAPK in breast cancer cells Creighton et al. Cancer Res. 2006 66(7):3903-11.
Preliminary gene expression profiling studies of cancer • Hundreds of genes are deregulated in cancer. • Different subtypes of cancer are defined by gene expression profiling. • Gene expression signatures may predict cancer patient survival. • Gene expression signatures of oncogenic signaling pathways can be defined using experimental models (cell lines, mice).
Potential uses for gene expression profiling of cancer • Define and understand the molecular pathways that underlie cancer. • Define subgroups of patients for the purposes of optimizing treatment. • Determine whether or not a patient would benefit from a given therapy (e.g. chemotherapy). • Determine what specific pathways are deregulated in the tumor and treat the tumor with therapies that target that pathway (e.g. hormone therapy for ER+ breast cancer).
General concepts of gene expression analysis • Low level analysis • Processing image files • Normalization • Quality Control (QC) • High level analysis • Clustering • Selecting differentially expressed genes • Enrichment analysis or “Meta-analysis”
Publicly available gene expression profile data represents a rich resource • When publishing studies using gene expression profile data, authors are encouraged to make the data available to everyone. • Subsequent studies can re-analyze the data with different questions in mind from what the original authors had.
GEO database (http://www.ncbi.nlm.nih.gov/geo/) make thousands of expression profile datasets publicly available. Many top journals require microarray studies to make data public on GEO
Pathway-related gene sets: Gene Ontology (GO) terms • The Gene Ontology project provides a controlled vocabulary to describe gene attributes. • Three major categories: • Cellular component • Biological process • Molecular function • The controlled vocabularies are structured so that they can be queried at different levels: • For example, use GO to find all gene products involved in ‘signal transduction’, or zoom in on all ‘receptor tyrosine kinases’. www.geneontology.org
Pathway-related gene sets: Molecular Signature Database (mSigDB) • From the Broad Institute • Collection of gene sets curated from the literature (including gene expression profiling studies). • Current version represents over 1800 pathway-associated genes sets http://www.broad.mit.edu/gsea/msigdb/index.jsp
Gene “signatures” • Will be loosely defined here to mean a set of genes that are functionally associated with each other in some way. • Ways to define gene signatures: • Gene annotation (e.g. Gene Ontology terms) • Curated pathway-associated gene sets • Literature review articles • “Gene expression signature”, gene signature defined using expression profiling data • e.g. what genes go up or down in response to treatment in an experimental model)
Gene expression signatures • When using expression profiling to define genes, a gene expression signature consists of two things: • A set of genes going “up” (relative to something). • A set of genes going “down” (relative to something). • Relative direction of the genes (up-regulated vs down-regulated, or over-expressed vs under-expressed) is important. • Keep the “up” genes separated from the “down” genes.
How do we relate gene expression profile results from different datasets to each other?
The enrichment problem • A: Given a gene set or sets of interest. • i.e. a “gene signature” • B: Given an independent expression dataset with the profiled genes being ranked by a specified metric. • e.g. “cancer vs. normal” or “correlation with MYC.” • Are the genes in (A) enriched within (B)? • i.e. do the results of (A) and (B) overlap significantly?
Methods for determining enrichment • Venn diagram, or “marble jar” approach • Take the top set of genes from the expression dataset (dataset B), tabulate the amount of overlap with the independent gene set of interest (dataset A). • Rank-based approach • Use the entire dataset, including genes of borderline significance or showing a weak trend towards significance. • Correlation approach • For a set of genes, compute correlation between two sets of weighting factors (based on different profiling datasets).
Venn diagram enrichment analysis • Requires us to make a “cut” to define what the top genes are. • Significance of overlap may be determined by chi-square or one-sided Fisher’s exact tests.
Venn diagram enrichment analysis Define gene set of interest • Requires us to make a “cut” to define what the top genes are. • Significance of overlap may be determined by chi-square or one-sided Fisher’s exact tests.
Venn diagram enrichment analysis Define differentially expressed genes • Requires us to make a “cut” to define what the top genes are. • Significance of overlap may be determined by chi-square or one-sided Fisher’s exact tests.
Venn diagram enrichment analysis Determine overlap between the two gene sets • Requires us to make a “cut” to define what the top genes are. • Significance of overlap may be determined by chi-square or one-sided Fisher’s exact tests.
Hypergeometric formula (one-sided Fisher’s exact test) • Number of genes in total population: G • Genes in G falling under pre-defined class: A • Number of genes selected: k • Number of selected genes k in class A: n • The number of genes expected to overlap by chance: (k X A)/G • One-sided Fisher’s exact test determines whether n is significantly greater than (kXA)/G
Hypergeometric formula (one-sided Fisher’s exact test) • Number of genes in total population: G • Genes in G falling under pre-defined class: A • Number of genes selected: k • Number of selected genes k in class A: n • The probability P for the term occurring n or more times within a set of k genes randomly selected from the population:
What is the total gene population (G)? • Can represent the number of genes profiled on the array chip. • What if two different array platforms were used (a different set of genes are typically represented in each)? • Use the common set of genes represented on both array chips as the total population (do not consider genes not represented on both arrays) • Use ONE of the two array platforms to define the gene population (do not consider genes on the other array platform that are not represented on the first platform)
Compared lung cancer cell lines with or without an activating mutation in EGFR. Wanted to compare this gene signature with another gene signature of EGFR A gene signature of mutation of EGFR in NSCLC cell lines Lung cancer cell lines Choi, Creighton, et al., PLoS ONE 2(11): e1226.
Oncogenic signatures of ERBB2, EGFR, MEK, RAF, and MAPK in breast cancer cells • Does the published MCF-7+EGFR signature overlap with the NSCLC EGFR signature? Creighton et al. Cancer Res. 2006 66(7):3903-11.
Compare NSCLC EGFR mutant signature with a signature of EGFR-transfected MCF-7 cells • EGFR wt NSCLC genes: 119 • MCF7 EGFR genes: 1152 • Genes shared between MCF7/NSCLC array platforms: 11079 • Genes shared between MCF7/NSCLC gene signatures: 44 significance of overlap p<1E-10 One-sided Fisher’s exact test Choi, Creighton, et al., PLoS ONE 2(11): e1226.
A gene signature of mutation of EGFR in NSCLC cell lines is enriched with EGFR-depended genes. Choi, Creighton, et al., PLoS ONE 2(11): e1226.
Experimental models versus clinical tumors • Molecular data from experimental models represent dynamic information, but clinical relevance is not always clear (e.g. could represent experimental artifacts). • Data from clinical tumor specimens represent more static information, where the associations observed may be pathologically relevant.
Experimental models versus clinical tumors • From clinical data, cannot distinguish cause-and-effect associations from correlation alone. • In cancer studies, important to combine the experimental with the clinical. • Some researchers may doubt the validity of experimental results unless they can be shown to apply to human tissues
Rank-based approaches use all of the genes from one of the datasets to determine enrichment (does not make a “cut”). Ranked-based enrichment analysis Locations of genes from set B Rank ordered genes from dataset A
GSEA (rank-based) enrichment analysis All the genes in the dataset are used here Subramanian, Aravind et al. (2005) Proc. Natl. Acad. Sci. USA 102, 15545-15550 • Start from the top of the Ranked list. • Add points to “Random walk” for each gene you find in S. • Remove points from “Random walk” for each gene not in S.
GSEA Kolmogorov-Smirnov statistic Consider the genes R1,.., RN that are ordered on the basis of the difference metric between the two classes and a gene set S containing G members. We defineif Ri is not a member of S, orif Ri is a member of S.We then compute a running sum across all N genes. The ES is defined asor the maximum observed positive deviation of the running sum.
GSEA Kolmogorov-Smirnov statistic • The ES score (the “peak” of the Random walk) is just a number. • Need to evaluate the significance of the number by some type of permutation testing: • Permute the sample labels many times, OR • Permute the gene sets (i.e. randomly generate gene sets). • In either case, compare distribution of scores from random tests with the actual score.
GSEA (rank-based) enrichment analysis Subramanian, Aravind et al. (2005) Proc. Natl. Acad. Sci. USA 102, 15545-15550 Examples of GSEA running enrichment scores
GSEA (rank-based) enrichment analysis Subramanian, Aravind et al. (2005) Proc. Natl. Acad. Sci. USA 102, 15545-15550 Sets with genes not located at the top of the ranked gene population may still yield significant enrichment scores.
A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer Lamb, et al. Cell 114:323-34, 2003
The Connectivity Map of gene signatures induced by 164 different small molecule inhibitors Lamb et al., Science. 2006 313(5795):1929-35
The Connectivity Map (Scoring derived from GSEA statistic)
Q1: Compare enrichment pattern to that for randomly select gene sets Q2: Compare enrichment pattern to that for randomly permuted labels in the reference profile dataset Q1-Q2 analysis (another ranked based approach) Tian, et al. PNAS 102:13544-13549, 2003
A gene expression signature of Akt overexpression from a transgenic mouse model Majumder et al. Nat Med 10: 594–601, 2004
Venn diagram vs Rank-based methods • Venn diagram results more easily interpretable. • For rank-based methods, genes that are not at all significant individually may contribute to enrichment. • What gene do you go after for validation? • With venn diagram, have to make a cut. • May not include enough genes in the test.
Venn diagram vs Rank-based methods, what is a significant p-value? • If using the Venn diagram method in expression studies, p-value should be very low if working with sizable gene sets (e.g. <1E-6). • If using rank-based method, can consider a nominally significant p-value (e.g. p<0.05) to be good if permuting the sample labels is involved. • Can always try both ways in order to be certain of an enrichment association.
Rank-based: Q1-Q2 versus GSEA • Q1-Q2 enrichment score is much simpler • Take the sum of the t-statistic values for each gene in the set. • GSEA scoring is more complicated. • GSEA has user-friendly public software (http://www.broad.mit.edu/gsea/) • No software yet for Q1-Q2, have to write your own.
Correlation-based approach • Take the correlation between two sets of profiling results from different datasets. • May use all of the genes profiled or a specified subset (e.g. genes in a gene signature). • The correlation metric may be any one of a number of valid metrics (e.g. Pearson’s or Spearman’s rank).