530 likes | 541 Views
Explore the advantages of pathway analysis over single-gene methods in modeling cancer progression with examples and solutions. Learn about gene sets, their derivation, and importance in context. Discover how pathway analysis identifies genetic perturbations.
E N D
Mechanistic Models of Cancer Progression in the Space of Pathways Elena Edelman eje2@duke.edu Computational Biology and Bioinformatics Program Institute of Genome Policy and Science Duke University
Outline I. Biological Background • Problems with single gene analysis • Advantages of pathway analysis II. Gene Sets • How they are derived • Importance of understanding context III. Modeling Cancer Progression • Overview of multitask model • Prostate cancer example • Melanoma example Mechanistic Models of Cancer Progression, Elena Edelman presenting Mechanistic Models of Cancer Progression, Elena Edelman presenting
Disadvantages of single gene based methods • Hundreds of differentially expressed genes • Subtle signals • Lack of consensus
Solutions • Hundreds of differentially expressed genes – group together in a small number of pathways • Subtle signals – brought to attention when seen as a group • Lack of consensus – consensus in processes/pathways, not single genes
Disadvantage of single gene methods 13,023 genes ↓ 1,149 mutated genes ↓ 189 candidate cancer genes ↓ Each sample of a given tumor type had no more than six mutated CAN genes in common Sjoblom 2006
Importance of pathway analysis • Deregulation of specific processes are necessary for tumor formation. Each process has many potential member genes. • Alteration of a number of different genes will provide the same phenotypic result.
Rb pathway • Several cancer genes control transitions from resting state (G0 or G1) to replicating phase (S) of cell cycle. • Diverse protein products: • cdk4 (kinase), oncogene • cyclin D1 (activates cdk4), oncogene • Rb (transcription factor), TSG • p16 (inhibits cdk4), TSG
P53 TSG • P53 is a transcription factor that inhibits cell growth and stimulates cell death • Point mutation inactivates its capacity to bind specifically to its recognition sequence. • Other ways to achieve the same effect • Amplification of MDM2 • Infection with DNA tumor viruses whose products bind to p53 and functionally inactivate it.
→ Pathway Analysis • Identify gene sets whose expression patterns characterize specific genetic or molecular perturbations. • Early pathway analysis: Apply methods such as t-tests to determine differentially expressed genes between two classes. Use database such as Gene Ontology to relate individual genes in terms of general cellular function.
Pathway Analysis • Next step in pathway analysis: Gene Set Enrichment Analysis (GSEA) & Analysis of Sample Set Enrichment Score (ASSESS) • Start with biological information: Gene sets • Score enrichment of gene sets in an expression profile with samples from two classes • GSEA outputs enrichment scores for each gene set in each phenotype • ASSESS outputs enrichment scores for each gene set in each individual
G1 G2 G3 Enrichment Analysis Given a ranked gene list and a gene set of interest, find genes in the set that are “enriched” at the top or bottom of the list. How could we conclude that G1 is enriched but G2 and G3 are not?
Outline I. Biological Background • Problems with single gene analysis • Advantages of pathway analysis II. Gene Sets • How they are derived • Importance of understanding context III. Modeling Cancer Progression • Overview of multitask model • Prostate cancer example • Melanoma example
Gene Sets • Defined functionally or structurally • Defined by experimental methods or through literature. • Experimental: Knockouts, infections • Literature: Biochemical experiments, reported in databases such as BioCarta and GenMapp
ASSESS of male vs. female in lymphoblastoid cells GENE SETS SAMPLES
Gene Set Accuracy • Analyses will depend on accuracy of gene sets. We ask: • What is the accuracy of gene sets annotated according to known perturbations? • How do gene sets defined by experimental studies vs. expert knowledge compare?
Hypoxia Gene Set • Hypoxia: The cellular response to low oxygen conditions. Includes new blood vessel formation • Seven hypoxia gene sets describing the cellular response to hypoxia
Hypoxia gene set accuracy • Expression data set with 6 hypoxic and 6 normal cells (Mense 2006) • GSEA applied with database of 508 gene sets.
RAS • 3 Ras gene sets: K-Ras, H-Ras, and the Ras pathway from Biocarta. • K-RAS and H-RAS are experimentally defined and context specific. • Biocarta's Ras gene set in the most general, consisting of genes thought to biochemically interact with RAS and proteins associated with RAS.
RAS gene set accuracy • Gene expression profile of 31 cells with tumors caused by K-RAS mutation and 19 normal cells. • H-RAS does not capture K-RAS specificity. • BioCarta's RAS gene set is appropriate to use regardless of the specific RAS mutation.
RAS gene set accuracy • Gene expression profile of 45 adenocarcinomas and 48 squamous lung cancer samples. • Data set indirectly involves RAS perturbations. • Enrichment scores from ASSESS were used to predict phenotype. Class prediction accuracy for the three sets: • 69.9% for the H-RAS pathway gene set • 75.3% for the K-RAS pathway gene set • 79.6% for the BioCarta RAS pathway gene set
Outline I. Biological Background • Problems with single gene analysis • Advantages of pathway analysis II. Gene Sets • How they are derived • Importance of understanding context III. Modeling Cancer Progression • Overview of multitask model • Prostate cancer example • Melanoma example
Dynamics of Cancer Progression • Long lists of genes implicated in various stages of cancer exist for many different cancer types. Want to learn about the interaction of these genes via signaling pathways and functional relationships. • Next step is for a mechanistic understanding of cancer progression on the pathway level. • There are only a few types of cancers where we know which pathways acquire mutations that initiate tumorigenesis. • Eye: RB1 • Are other types of cancer initiated by one or several pathways becoming altered? • The alteration of one gene hardly ever suffices to give rise to full blown cancer. • Oncogenes, tumor suppressor genes (TSGs), and stability genes drive tumor progression. • Mammalian cells have multiple safeguards . Several genes must be defective for invasive cancer to develop.
Objectives • Identify pathways most relevant throughout progression and pathways most relevant to individual transitions. • Build pathway networks: Estimate the interdependence of pathways relevant to each step of tumor progression. • Refine relevant pathways and infer a gene network for those relevant genes sets.
Hierarchical Modeling • Tumor progression • FIXED EFFECTS: Stage in cancer progression. Individuals will show similar pathway deregulation as cancer progresses depending on whether they have benign, primary or metastatic lesions. • RANDOM EFFECTS: Within a stage, individuals will have differences based on how they specifically developed the disease.
Regularized Multitask Learning (RML) • Current analyses of genomic data evaluate each stage in progression independently, missing relationships between the data. • Integration of the data over all stages will provide a more complete picture of the processes underlying tumorigenesis. • RML learns a problem together with other related problems at the same time. Learning the problems in parallel can help each problem be better learned by using a shared representation. • Problems: Which pathways are relevant to transition 1? Transition 2? Which pathways are relevant throughout progression?
Stratifying Data • States: normal (n), early (e), metastatic (m). • Data: Gene expression for g genes in s samples. Stratify data into T datasets, one for each step in progression. T=2: D1D2 n e m n e e m
Modeling tumor progression • Model Summary: Find relevant pathways in the overall progression {n→e→m} And the relevant pathways at different stages {n→e} and {e→m} The task t corresponds to progression from less serious to more serious states t=1: {n→e}, t=2: {e→m}
Transformation • Transformation: Gene expression data is transformed using ASSESS D: genes x samplesS: gene sets x samples D1D2 S1S2 n e e m n e 1 e m genes Gene sets → 20,000
Multitask SVM • Support vector machines (SVMs) - regularization method • Input regression data • Estimate a regression function f - a summary statistic of Y|X. • Multitask SVM • builds classification models jointly over all data sets, Y|S1, S2. • Provides a baseline model for gene sets relevant to predicting phenotype in both data sets, Y|S1,S2 • Provides gene sets relevant to only one data set, Y|S1 and Y|S2 • These regressions provide data set dependent corrections to the baseline model.
The Model • Input: x= S1, S2 • class labels, y={-1,1} where -1=less serious, 1=more serious. • Build two regression models ft1(x) and ft2(x), for transition 1 data and transition 2 data. • b(x)=baseline term over all tasks and rt(x)=task specific corrections • Discriminate functions: • w0 is a vector of baseline weights for the gene sets • vt1 is the vector of correction terms for transition 1 • vt2 is the vector of correction terms for transition 2 • b is a scalar offset
The Model • Parameters are estimated by minimization problem: where v(f(xit), yit) is a loss function. If tasks are thought to be highly related, set λ2/λ1 ratio to be large.
Model Interpretation • Interpretation: wjo – weight of jth gene set in a baseline model. Gene sets for which |wj0| are largest are relevant in {n→e→m} vjt – weight of the jth gene set in state progression t. Gene sets for which |vj1| is large are relevant in {n→e} and gene sets for which |vj2| are large are relevant in {e→m}.
Prostate Cancer • Gene expression profile of 22 benign epithelium samples (b), 32 primary prostate cancer samples (p), and 17 metastatic prostate cancer samples (m). Tomlins, 2007 • Progression {b→p→m} w0 v1 v2
Results • Categorized results by “Hallmarks of Cancer” – Hanahan, 2000 • Self sufficiency of growth signals • Insensitivity to anti-growth signals • Evasion of apoptosis • Defense against limitless replicative potential • Angiogenesis • Invasion and metastasis
Results • Self sufficiency in growth signals • Cell cycle gene sets • ErbB4, EGF, Sprouty, ERK
Results • Evidence for insensitivity to anti-growth signals: • PTEN down-regulation • PTDINS up-regulation • Evasion of apoptosis: • IGF1R up-regulation • ROS down-regulation • Energy production • Glycolysis gene set up-regulation • ATP synthesis gene set up-regulation • Oxidative phosphorylation up-regulation
Novel Findings • Took previous analysis a step further by discovering the specific pathways implicated in tumorigenesis. • Previous work identified single genes which were relevant in progression and grouped them together to form important concepts. • Currently little known about ErbB4 deregulation in PCA • EGF receptors have been implicated in several tumor type – stomach, brain, breast. • ErbB2/HER2 has been shown to be overexpressed in prostate cancer Tomlins 2007
Objective 2: Pathway dependency structure • Infer a pathway interaction network for each stage of progression using learning gradients and inverse regression . • Provide knowledge on how certain pathways relate, interact, and influence one another with respect to phenotype.
Objective 2 • Standard regression methods show which gene sets are correlated with class labels but do not provide information on the co-variation of gene sets correlated with class labels. • Estimate covariance of inverse regression C=cov(X|Y) • Input matrix of enrichment scores (X) and class labels (Y) • Output covariance matrix C=cov(X|Y) • Diagonal elements measure relevance of i-th gene set with respect to change in label. • ij-th off diagonal element measures the dependence between gene sets i and j. • Relationships will be visualized in graphical models.
Objective 2 • Analysis can identify pathways that are closely associated throughout progression: • IGF1R and ERK are linked through their association with RAS. ERK ranks 9th out of 522 gene sets based on the covariance with the IGF1R pathway. • PTDINS ranks 15th based on the covariance with the PTEN gene set • IGF1R ranks 32nd based on the covariance with PTDINS
Objective 2 • A: Dependency structure of the 10 gene sets most relevant in the benign to prostate cancer transition • B: Extended dependency structure
Objective 3: Refinement • Gene sets available are not always in the right context for a specific data set. • The refinement procedure adapts the gene set to the context of the data set. Shows which genes are dependent on each other and if there is substructure in the gene set. • Cluster genes in gene set based on their covariance: C=cov(X|Y); • X= gene expression value of genes in the gene set • Y= class labels • A gene network modeling the interdependence of the genes in the refined gene set is inferred.
Gene Set Refinement • The genes of BioCarta's ERK pathway • Refine the pathway to those genes most relevant for this data set. • A and B differ in threshold values
Melanoma Progression • Gene expression profile of 4 normal skin samples (n), 4 primary melanoma samples (p), and 4 metastatic melanoma samples (m). Smith, 2005. • Progression {n→p→m} w0 v1 v2
Melanoma Results • Self-sufficiency of growth • AKT up-regulation throughout progression • PTDINS up-regulation throughout progression • Escape from apoptosis • IGF1R up-regulation in the late transition • p53 down-regulation throughout progression • Defense against limitless replicative potential • HTERT up-regulation in the early transition • Angiogenesis • HIF up-regulation throughout progression • Angiogenesis gene set up-regulation in the early transition • Invasion and Metastasis • CDC42RAC up-regulation throughout progression • MTA3 down-regulation in the early transition
Validation • Gene expression profile of 9 samples of benign nevis, 6 samples of primary melanoma, and 19 samples of metastatic melanoma (Haqq 2005) • Both analysis found: • p53 gene set down-regulation • D4-GDI pathway over-expression • HTERT gene set over-expression • CDC42RAC pathway over-expression w0 v1 v2
Pathway Dependencies • A: Dependency structure of top 10 gene sets most relevant in the normal skin to primary melanoma transition • B: Extended dependency structure
Sterol Biosynthesis • Sterol biosynthesis gene set is highly connected • Tumor cells often have sterol synthesis deficiencies • One component of the sterol biosynthesis pathway is mevalonate pathway. • Many tumor cells can not synthesize mevalonate so they obtain is from the host
Pathways Dependencies • Interdependence with sterol biosynthesis gene sets out of 523 gene sets: • Fatty acid synthesis ranks 14th • Cyanoamino acid metabolism ranks 19th • Gamma hexachlorocyclohexane ranks 3rd • All are closely tied to the inability of a tumor to synthesize certain metabolites and its increasing need for these metabolites as it grows and develops.