Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering

1. Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering Steve Horvath shorvath@mednet.ucla.edu Human Genetics & Biostatistics University of California, Los Angeles

2. Contents Describe pathway based tumor marker screening strategy Speculate on the biological reasons why it could work. Describe 2 empirical success stories for identifying tumor markers that validated in independent data sets Brain cancer: survival time (Affy) gene expression microarray data weighted gene co-expression networks Prostate cancer: time to PSA recurrence tissue microarray data (immunohistochemical stainings) random forest clustering

3. The Embarassing Validation Problem A tumor marker is found to be highly predictive of a clinical outcome in one data set but fails to be validated in an independent data set. �Bad� (analysis) reasons include data snooping overfitting ascertainment issues �Good� (biological) reasons: genetic heterogeneity Little can be done about this. Single markers don�t capture the essence of the whole disease pathway. A lot can be done about this?NOVEL STATISTICAL METHODS FOR EXTRACTING SIGNAL FROM THE DATA.

4. Outline of standard strategy for screening for markers 1) Regress a clinical outcome y on the molecular markers (features) X. 2) Identify the features that are most significant or most predictive of the outcome using standard statistical feature selection methods Empirical finding: often poor validation success.

5. �Pathway Based� Strategy for Screening for Markers Find suitably defined clusters in the underlying high dimensional feature space X. Relate the clusters to clinical outcomes of interest. This results in a few �disease clusters� (a.k.a. pathways or modules) Use features (markers) that describe the states of the disease clusters as final predictors. (Limited) Empirical Finding: improved validation success

6. Motivating why the pathway based screening strategy may lead to better validation success: By first clustering the features, one reduces the number of multiple comparisons substantially By looking at aggregates of features (clusters) the feature definition is much more robust and more likely to be platform independent. Combining the features along pathways is the biologically meaningful thing to do. Pathways are closer to the clinical phenotype than the individual constituents of these pathways. The whole is more than the sum of its parts�

7. TEASERValidation success rate of gene expressions in independent data

8. Weighted Gene Co-Expression Network Analysis.

9. Novel statistical approach for analyzing microarray data: weighted network analysis Empirical evidence that it matters in practice Identification of Brain Cancer Genes that can be validated in an independent data set

10. Background Network based methods have been found useful in many domains, protein interaction networks the world wide web social interaction networks OUR FOCUS: gene co-expression networks

12. Scale free topology is a fundamental property of such networks (Barabasi et al) It entails the presence of hub nodes that are connected to a large number of other nodes Such networks are robust with respect to the random deletion of nodes but are sensitive to the targeted attack on hub nodes It has been demonstrated that metabolic networks exhibit a scale free topology

13. P(k) vs k in scale free networks

14. How to check Scale Free Topology?

15. Gene Co-expression Networks In gene co-expression networks, each gene corresponds to a node. Two genes are connected by an edge if their expression values are highly correlated. Definition of �high� correlation is somewhat tricky we propose a criterion for picking threshold parameter.

16. Steps for constructing asimple, unweighted co-expression network Hi

17. Our `holistic� view�.

18. A general frame work for defining weighted gene co-expression networksBin Zhang, Steve HorvathTechnical report and R code at www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

19. Beyond the standard approach Dichotomization allows one to easily define network-based concepts but it eliminates some information regarding the strength of interaction. To overcome the disadvantage of the dichotomization, we generalize the approach Measure co-expression by a similarity s(i,j) with range [0,1] e.g. absolute value of the Pearson correlation Define an adjacency matrix A(i,j)=AF(s(i,j)) The adjacency function AF is a monotonic, non-negative function defined on [0,1] and depends on parameters. The choice of the parameters determines the properties of the network. We consider 2 types of AFs Step function AF(s)=I(s>tau) with parameter tau Power function AF(s)=sb with parameter

20. Comparing adjacency functions:

21. How to estimate the parameter values of an adjacency function? We propose to use the following criteria: A) CONSIDER ONLY THOSE PARAMETER VALUES THAT RESULTS IN APPROXIMATE SCALE FREE TOPOLOGY B) SELECT THE PARAMETERS THAT RESULT IN THE HIGHEST MEAN NUMBER OF CONNECTIONS Criterion A is motivated by the finding that most metabolic networks (including gene co-expression networks, protein-protein interaction networks and cellular networks) have been found to exhibit a scale free topology Criterion B is motivated by our desire to have high sensitivity to detect modules (clusters of genes) and hub genes.

22. Criterion A is measured by the linear model fitting index R^2

23. Trade-off between criterion A (R^2) and criterion B (mean no. of connections) when varying the power b

24. Empirical insights for determining the adjacency function For criterion A measure compliance with scale free topology by using the adjusted R^2 value for the linear regression fit between log(p(k)) and log(k) Usually require R^2>0.8 For criterion B: aim to get a mean(k)=50 when dealing with 2000 genes.

25. Trade-off between criterion A and B when varying tau

26. Mathematical Definition of an Undirected Network

27. Network=Adjacency Matrix A network can be represented by an adjacency matrix, A=[aij], that encodes whether/how a pair of nodes is connected. A is a symmetric matrix with entries in [0,1]. For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected). For weighted networks, the adjacency matrix reports the connection strength between gene pairs.

28. Generalized Connectivity Gene connectivity correspond to the row sums of the adjacency matrix For unweighted networks=number of direct neighbors For weighted networks= sum of connection strengths to other nodes

29. Network Analysis Flow Chart

31. Network Distance Measure: Topological Overlap Matrix

32. How to measure distance in a network? Mathematical Answer: Geodesics length of shortest path connecting 2 nodes we have found no empirical evidence that this is a biologically meaningful concept in co-expression networks Biological Answer: look at shared neighbors Intuition: if 2 people share the same friends they are close in a social network Use the topological overlap measure based distance proposed by Ravasz et al 2002 Science)

33. Topological Overlap (Ravasz et al) leads to a network distance measure Generalized in Zhang and Horvath (2005) to the case of weighted networks Generalized in Yip and Horvath (2005) to higher order interactions

34. Using the TOM matrix to cluster genes To group nodes with high topological overlap into modules (clusters), we typically use average linkage hierarchical clustering coupled with the TOM distance measure. Once a dendrogram is obtained from a hierarchical clustering method, we choose a height cutoff to arrive at a clustering. Here modules correspond to branches of the dendrogram

35. More traditional view of module

37. Hub Genes Predict Survival for Brain Cancer PatientsMischel PS, Zhang B,et al, Horvath S, Nelson SF.

39. Mean Prognostic Significance of Module Genes

40. Module hub genes predict cancer survival Cox model to regress survival on gene expression levels Defined prognostic significance as �log10(Cox-p-value) the survival association between each gene and glioblastoma patient survival A module-based measure of gene connectivity significantly and reproducibly identifies the genes that most strongly predict patient survival

41. The fact that genes with high intramodular connectivity are more likely to be prognostically significant facilitates a novel screening strategy for finding prognostic genes Focus on those genes with significant Cox regression p-value AND high intramodular connectivity. It is essential to to take a module centric view: focus on intramodular connectivity of disease related module Validation success rate= proportion of genes with independent test set Cox regression p-value<0.05. Validation success rate of network based screening approach (68%) Standard approach involving top 300 most significant genes: 26%

42. Validation success rate of gene expressions in independent data

43. New Application:Tissue Microarray Data

44. Tissue MicroarrayDNA Microarray

45. Tissue Array Section An H&E stained slide produced from a tissue array block is depicted in this diagram. When the arrayed pattern is magnified, the individual circular portions of tissue are clearly seen.An H&E stained slide produced from a tissue array block is depicted in this diagram. When the arrayed pattern is magnified, the individual circular portions of tissue are clearly seen.

46. Ki-67 Expression in Kidney Cancer Histomorphologic illustration of increasing Ki-67 protein staining in high, compared to low grade renal cell carcinomas.Histomorphologic illustration of increasing Ki-67 protein staining in high, compared to low grade renal cell carcinomas.

47. Multiple measurements per patient:Several spots per tumor sample and several �scores� per spot

48. Properties of TMA Data Highly skewed, non-normal,semi-continuous. Often a good idea to model as ordinal variables with many levels. Staining scores of the same markers are highly correlated

50. Frequency plot of the same tumor marker in 2 independent data sets

51. Thresholding methods for tumor marker expressions Since clinicians and pathologists prefer thresholding tumor marker expressions, it is natural to use statistical methods that are based on thresholding covariates, e.g. regression trees, survival trees, rpart, forest predictors etc. Dichotomized marker expressions are often fitted in a Cox (or alternative) regression model Danger: Over-fitting due to optimal cut-off selection. Several thresholding methods and ways for adjusting for multiple comparisons are reviewed in Liu X, Minin V, Huang Y, Seligson DB, Horvath S (2004) Statistical Methods for Analyzing Tissue Microarray Data. J of Biopharmaceutical Statistics. Vol 14(3) 671-685

52. Finding tumor markers for predicting clinical outcomes on the basis of Tissue Microarray Data

53. Using the clustering based strategy for finding tumor markers 1) Find distinct patient clusters without regard to outcome 2) Find whether patient clusters have distinct PSA recurrence profiles 3) If so, find rules (classifiers) for predicting cluster membership 4) Validate those rules in independent data.

55. Cluster Analysis of Low Gleason Score Prostate Samples(UCLA data)

56. 1) Construct a tumor marker rule for predicting RF cluster membership.2) Validate the rule predictions in an independent data set

57. Discussion Prostate TMA Data Very weak evidence that individual markers predict PSA recurrence None of the markers validated individually However, cluster membership was highly predictive, i.e the rule could be validated in an independent data set.

58. How to cluster patients on the basis of Tissue Microarray Data?

59. Questions: 1)Can TMA data be used for tumor class discovery, i.e unsupervised learning?2) If so, what are suitable unsupervised learning methods?

60. Tumor Class Discovery using DNA Microarray Data Tumor class discovery entails using a unsupervised learning algorithm (i.e. hierarchical, k-means, SOM clustering etc.) to automatically group tumor samples based on their gene expression pattern.

61. Clusters involving TMA data may have unconventional shapes:Low risk prostate cancer patients are colored in black.

62. Unconventional shape of a clinically meaningful patient cluster 3 dimensional scatter plot along tumor markers Low risk patients are colored in black

63. A dissimilarity measure is an essential input for tumor class discovery Dissimilarities between tumor samples are used in clustering and other unsupervised learning techniques Commonly used dissimilarity measures include Euclidean distance, 1 - correlation

64. Challenge Conventional dissimilarity measures that work for DNA microarray data may not be optimal for TMA data. Dissimilarity measure that are based on the intuition of multivariate normal distributions (clusters have elliptical shapes) may not be optimal For tumor marker data, one may want to use a different intuition: clusters are described using thresholding rules involving dependent markers. It may be desirable to have a dissimilarity that is invariant under monotonic transformations of the tumor marker expressions.

65. We have found that a random forest (Breiman 2001) dissimilarity can work well in the unsupervised analysis of TMA data.Shi et al 2004, Seligson et al 2005.http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm

66. Kidney cancer:Comparing PAM clusters that result from using the RF dissimilarity vs the Euclidean distance

67. The RF dissimilarity is determined by dependent tumor markers The RF dissimilarity focuses on the most dependent markers (1,2). In some applications, it is good to focus on markers that are dependent since they may constitute a disease pathway. The Euclidean distance focuses on the most varying marker (4)

68. The RF cluster can be described using a thresholding rule involving the most dependent markers Low risk patient if marker1>cut1 & marker2> cut2 This kind of thresholding rule can be used to make predictions on independent data sets. Validation on independent data set

69. Theoretical reasons for using an RF dissimilarity for TMA data Main reasons natural way of weighing tumor marker contributions to the dissimilarity The more related a tumor marker is to other tumor markers the more it contributes to the definition of the dissimilarity no need to transform the often highly skewed features based feature ranks Chooses cut-off values automatically resulting clusters can often be described using simple thresholding rules Other reasons elegant way to deal with missing covariates intrinsic proximity matrix handles mixed variable types well CAVEAT: The choice of the dissimilarity should be determined by the kind of patterns one hopes to find. There will be situations when other dissimilarities are preferrable.

70. The random forest dissimilarityL. Breiman: RF manualTechnical Report: Shi and Horvath 2005http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm

71. Summary:Random forest clustering Intrinsic variable selection focuses on dependent variables Depending on the application, this can be attractive Resulting clusters can often be described using thresholding rules?attractive for TMA data. RF dissimilarity invariant to monotonic transformations of variables In some cases, the RF dissimilarity can be approximated using a Euclidean distance of ranked and scaled features. RF clustering was originally suggested by L. Breiman (RF manual). Theoretical properties are studied as part of the dissertation work of Tao Shi. Technical report/code can be found at www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm www.genetics.ucla.edu/labs/horvath/kidneypaper/RCC.htm

72. Conclusions There is a need to identify/develop appropriate data mining methods for TMA data highly skewed, semi-continuous, non-normal data tree or forest based methods work well ALTERNATIVES?

73. Acknowledgements Former students & Postdocs for TMA Tao Shi PhD Xueli Liu PhD Yunda Huang PhD Tuyen Hoang PhD UCLA Tissue Microarray Core David Seligson, MD Hyung Kim, MD Arie Belldegrun, MD Robert Figlin, MD Siavash Kurdistani, MD

74. References RF clustering Unsupervised learning tasks in TMA data analysis Review random forest predictors (introduced by L. Breiman) Shi, T. and Horvath, S. (2005) �Unsupervised learning using random forest predictors� Journal of Computational and Graphical Statistics www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm Application to Tissue Array Data Shi, T., Seligson, D., Belldegrun, A. S., Palotie, A., Horvath, S. (2004) Tumor Profiling of Renal Cell Carcinoma Tissue Microarray Data Seligson DB, Horvath S, Shi T, Yu H, Tze S, Grunstein M, Kurdistani S (2005) Global histone modification patterns predict risk of prostate cancer recurrence.

Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering

Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering

Presentation Transcript

Lecture 9: Gene expression analysis/Clustering

Basic Gene Expression Data Analysis--Clustering

Gene Expression Networks

Extended Overview of Weighted Gene Co-Expression Network Analysis (WGCNA)

Threshold selection in gene co-expression networks using spectral graph theory techniques

Weighted Clustering

Detecting Network Motifs in Gene Co-expression Networks

An Overview of Weighted Gene Co-Expression Network Analysis

Extended Overview of Weighted Gene Co-Expression Network Analysis (WGCNA)

Gene Expression Networks

Clustering Gene Expression Data

Gene Expression Analysis Using Bayesian Networks

A General Framework for Weighted Gene Co-Expression Network Analysis

Clustering of Gene Expression Time Series with Conditional Random Fields

Discrimination and clustering with microarray gene expression data

Bayesian Networks & Gene Expression

Clustering Gene Expression Data

Clustering Short Gene Expression Profiles

Soft clustering of gene expression data

Clustering Gene Expression Data

Clustering Gene Expression Data

Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering