Part 5: Linking Microarray Data with Survival Analysis

Part 5: Linking Microarray Data with Survival Analysis

Use of microarray data via model-based classification in the study and prediction of survival from lung cancer (Ben-Tovim Jones et al., 2005)

Problems • Censored Observations – the time of occurrence of the event • (death) has not yet been observed. • Small Sample Sizes – study limited by patient numbers • Specific Patient Group – is the study applicable to other • populations? • Difficulty in integrating different studies (different • microarray platforms)

A Case Study: The Lung Cancer data sets from CAMDA’03 Four independently acquired lung cancer data sets (Harvard, Michigan, Stanford and Ontario). The challenge: To integrate information from different data sets (2 Affy chips of different versions, 2 cDNA arrays). The final goal: To make an impact on cancer biology and eventually patient care. “Especially, we welcome the methodology of survival analysis using microarrays for cancer prognosis (Park et al. Bioinformatics: S120, 2002).”

Methodology of Survival Analysis using Microarrays Cluster the tissue samples (eg using hierarchical clustering), then compare the survival curves for each cluster using a non-parametric Kaplan-Meier analysis (Alizadeh et al. 2000). Park et al. (2002), Nguyen and Rocke (2002) used partial least squares with the proportional hazards model of Cox. Unsupervised vs. Supervised Methods Semi-supervised approach of Bair and Tibshirani (2004), to combine gene expression data with the clinical data.

AIM: To link gene-expression data with survival from lung cancer in the CAMDA’03 challenge A CLUSTER ANALYSIS We apply a model-based clustering approach to classify tumour tissues on the basis of microarray gene expression. B SURVIVAL ANALYSIS The association between the clusters so formed and patient survival (recurrence) times is established. C DISCRIMINANT ANALYSIS We demonstrate the potential of the clustering-based prognosis as a predictor of the outcome of disease.

Lung Cancer Approx. 80% of lung cancer patients have NSCLC (of which adenocarcinoma is the most common form). All Patients diagnosed with NSCLC are treated on the basis of stage at presentation (tumour size, lymph node involvement and presence of metastases). Yet 30% of patients with resected stage I lung cancer will die of metastatic cancer within 5 years of surgery. Want a prognostic test for early-stage lung adenocarcinoma to identify patients more likely to recur, and therefore who would benefit from adjuvant therapy.

Lung Cancer Data Sets (see http://www.camda.duke.edu/camda03) Wigle et al. (2002), Garber et al. (2001), Bhattacharjee et al. (2001), Beer et al. (2002).

Heat Map for 2880 Ontario Genes (39 Tissues) Genes Tissues

Heat Maps for the 20 Ontario Gene-Groups (39 Tissues) Genes Tissues Tissues are ordered as: Recurrence (1-24) and Censored (25-39)

Expression Profiles for Useful Metagenes (Ontario 39 Tissues) Gene Group 1 Gene Group 2 Our Tissue Cluster 1 Our Tissue Cluster 2 Log Expression Value Recurrence (1-24) Censored (25-39) Gene Group 19 Gene Group 20 Tissues

Tissue Clusters CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES yields TWO CLUSTERS: CLUSTER 1 (31): 23(recurrence) plus 8 (censored) CLUSTER 2 (8): 1(recurrence) plus 7(censored) Poor-prognosis Good-prognosis

SURVIVAL ANALYSIS: LONG-TERM SURVIVOR (LTS) MODEL whereT is time to recurrenceandp1=1- p2 is the prior prob. of recurrence. Adopt Weibull model for the survival function for recurrence S1(t).

Fitted LTS Model vs. Kaplan-Meier

PCA of Tissues Based on Metagenes Second PC First PC

PCA of Tissues Based on All Genes (via SVD) Second PC First PC

Cluster-Specific Kaplan-Meier Plots

Survival Analysis for Ontario Dataset • Nonparametric analysis: A significant difference between Kaplan-Meier estimates for the two clusters (P=0.027). • Cox’s proportional hazards analysis:

Discriminant Analysis (Supervised Classification) A prognosis classifier was developed to predict the class of origin of a tumor tissue with a small error rate after correction for the selection bias. A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes. A cross-validation (CV) procedure was used to calculate the prediction error, after correction for the selection bias.

ONTARIO DATA (39 tissues): Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) 0.12 0.1 0.08 Error Rate (CV10E) 0.06 0.04 0.02 0 0 2 4 6 8 10 12 log2 (number of genes) Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38; G2: 15,30-32,34,35,37,39)

STANFORD DATA 918 genes based on 73 tissue samples from 67 patients. Row and column normalized, retained 451 genes after select-genes step. Used 20 metagenes to cluster tissues. Retrieved histological groups.

Heat Maps for the 20 Stanford Gene-Groups (73 Tissues) Genes Tissues Tissues are ordered by their histological classification: Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal (48-52), Squamous cell (53-68), Small cell (69-73)

STANFORD CLASSIFICATION: Cluster 1: 1-19 (good prognosis) Cluster 2: 20-26 (long-term survivors) Cluster 3: 27-35 (poor prognosis)

Heat Maps for the 15 Stanford Gene-Groups (35 Tissues) Genes Tissues Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)

Expression Profiles for Top Metagenes (Stanford 35 AC Tissues) Gene Group 1 Gene Group 2 StanfordAC group 1 StanfordAC group 2 StanfordAC group 3 Misallocated Log Expression Value Gene Group 4 Gene Group 3 Tissues

Cluster-Specific Kaplan-Meier Plots

Survival Analysis for Stanford Dataset • Kaplan-Meier estimation: A significant difference in survival between clusters (P<0.001) • Cox’s proportional hazards analysis:

Survival Analysis for Stanford Dataset • Univariate Cox’s proportional hazards analysis (metagenes):

Survival Analysis for Stanford Dataset • Multivariate Cox’s proportional hazards analysis (metagenes): The final model consists of four metagenes.

STANFORD DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) 0.07 0.06 0.05 0.04 Error Rate (CV10E) 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 log2 (number of genes) Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). Applied to g=2 clusters.

CONCLUSIONS • We applied a model-based clustering approach to • classify tumors using their gene signatures into: • clusters corresponding to tumor type • clusters corresponding to clinical outcomes for tumors of a given subtype • In (a), almost perfect correspondence between • cluster and tumor type, at least for non-AC • tumors (but not in the Ontario dataset).

CONCLUSIONS (cont.) The clusters in (b) were identified with clinical outcomes (e.g. recurrence/recurrence-free and death/long-term survival). We were able to show that gene-expression data provide prognostic information, beyond that of clinical indicators such as stage.

CONCLUSIONS (cont.) Based on the tissue clusters, a discriminant analysis using support vector machines (SVM) demonstrated further the potential of gene expression as a tool for guiding treatment therapy and patient care to lung cancer patients. This supervised classification procedure was used to provide marker genes for prediction of clinical outcomes. (In addition to those provided by the cluster-genes step in the initial unsupervised classification.)

LIMITATIONS Small numberof tumors available (e.g Ontario and Stanford datasets). Clinical data available for only subsets of the tumors; often for only one tumor type (AC). High proportion of censored observations limits comparison of survival rates.

Part 5: Linking Microarray Data with Survival Analysis