740 likes | 956 Views
Contents. Describe pathway based tumor marker screening strategySpeculate on the biological reasons why it could work.Describe 2 empirical success stories for identifying tumor markers that validated in independent data setsBrain cancer: survival time(Affy) gene expression microarray dataweight
E N D
1. Improved Tumor Marker Validation Success Using Weighted Gene Co-expression Networks and Random Forest Clustering Steve Horvath
shorvath@mednet.ucla.edu
Human Genetics & Biostatistics
University of California, Los Angeles
2. Contents Describe pathway based tumor marker screening strategy
Speculate on the biological reasons why it could work.
Describe 2 empirical success stories for identifying tumor markers that validated in independent data sets
Brain cancer: survival time
(Affy) gene expression microarray data
weighted gene co-expression networks
Prostate cancer: time to PSA recurrence
tissue microarray data (immunohistochemical stainings)
random forest clustering
3. The Embarassing Validation Problem A tumor marker is found to be highly predictive of a clinical outcome in one data set but fails to be validated in an independent data set.
“Bad” (analysis) reasons include
data snooping
overfitting
ascertainment issues
“Good” (biological) reasons:
genetic heterogeneity
Little can be done about this.
Single markers don’t capture the essence of the whole disease pathway.
A lot can be done about this?NOVEL STATISTICAL METHODS FOR EXTRACTING SIGNAL FROM THE DATA.
4. Outline of standard strategy for screening for markers
1) Regress a clinical outcome y on the molecular markers (features) X.
2) Identify the features that are most significant or most predictive of the outcome using standard statistical feature selection methods
Empirical finding: often poor validation success.
5. “Pathway Based” Strategy for Screening for Markers Find suitably defined clusters in the underlying high dimensional feature space X.
Relate the clusters to clinical outcomes of interest. This results in a few “disease clusters” (a.k.a. pathways or modules)
Use features (markers) that describe the states of the disease clusters as final predictors.
(Limited) Empirical Finding: improved validation success
6. Motivating why the pathway based screening strategy may lead to better validation success: By first clustering the features, one reduces the number of multiple comparisons substantially
By looking at aggregates of features (clusters) the feature definition is much more robust and more likely to be platform independent.
Combining the features along pathways is the biologically meaningful thing to do.
Pathways are closer to the clinical phenotype than the individual constituents of these pathways.
The whole is more than the sum of its parts…
7. TEASERValidation success rate of gene expressions in independent data
8. Weighted Gene Co-Expression Network Analysis.
9. Novel statistical approach for analyzing microarray data: weighted network analysis
Empirical evidence that it matters in practice
Identification of Brain Cancer Genes that can be validated in an independent data set
10. Background Network based methods have been found useful in many domains,
protein interaction networks
the world wide web
social interaction networks
OUR FOCUS: gene co-expression networks
12. Scale free topology is a fundamental property of such networks (Barabasi et al) It entails the presence of hub nodes that are connected to a large number of other nodes
Such networks are robust with respect to the random deletion of nodes but are sensitive to the targeted attack on hub nodes
It has been demonstrated that metabolic networks exhibit a scale free topology
13. P(k) vs k in scale free networks
14. How to check Scale Free Topology?
15. Gene Co-expression Networks In gene co-expression networks, each gene corresponds to a node.
Two genes are connected by an edge if their expression values are highly correlated.
Definition of “high” correlation is somewhat tricky
we propose a criterion for picking threshold parameter.
16. Steps for constructing asimple, unweighted co-expression network Hi
17. Our `holistic’ view….
18. A general frame work for defining weighted gene co-expression networksBin Zhang, Steve HorvathTechnical report and R code at www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
19. Beyond the standard approach Dichotomization allows one to easily define network-based concepts but it eliminates some information regarding the strength of interaction.
To overcome the disadvantage of the dichotomization, we generalize the approach
Measure co-expression by a similarity s(i,j) with range [0,1] e.g. absolute value of the Pearson correlation
Define an adjacency matrix A(i,j)=AF(s(i,j))
The adjacency function AF is a monotonic, non-negative function defined on [0,1] and depends on parameters. The choice of the parameters determines the properties of the network.
We consider 2 types of AFs
Step function AF(s)=I(s>tau) with parameter tau
Power function AF(s)=sb with parameter
20. Comparing adjacency functions:
21. How to estimate the parameter values of an adjacency function? We propose to use the following criteria:
A) CONSIDER ONLY THOSE PARAMETER VALUES THAT RESULTS IN APPROXIMATE SCALE FREE TOPOLOGY
B) SELECT THE PARAMETERS THAT RESULT IN THE HIGHEST MEAN NUMBER OF CONNECTIONS
Criterion A is motivated by the finding that most metabolic networks (including gene co-expression networks, protein-protein interaction networks and cellular networks) have been found to exhibit a scale free topology
Criterion B is motivated by our desire to have high sensitivity to detect modules (clusters of genes) and hub genes.
22. Criterion A is measured by the linear model fitting index R^2
23. Trade-off between criterion A (R^2) and criterion B (mean no. of connections) when varying the power b
24. Empirical insights for determining the adjacency function For criterion A measure compliance with scale free topology by using the adjusted R^2 value for the linear regression fit between log(p(k)) and log(k)
Usually require R^2>0.8
For criterion B: aim to get a mean(k)=50 when dealing with 2000 genes.
25. Trade-off between criterion A and B when varying tau
26. Mathematical Definition of an Undirected Network
27. Network=Adjacency Matrix A network can be represented by an adjacency matrix, A=[aij], that encodes whether/how a pair of nodes is connected.
A is a symmetric matrix with entries in [0,1].
For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected).
For weighted networks, the adjacency matrix reports the connection strength between gene pairs.
28. Generalized Connectivity Gene connectivity correspond to the row sums of the adjacency matrix
For unweighted networks=number of direct neighbors
For weighted networks= sum of connection strengths to other nodes
29. Network Analysis Flow Chart
31. Network Distance Measure: Topological Overlap Matrix
32. How to measure distance in a network? Mathematical Answer: Geodesics
length of shortest path connecting 2 nodes
we have found no empirical evidence that this is a biologically meaningful concept in co-expression networks
Biological Answer: look at shared neighbors
Intuition: if 2 people share the same friends they are close in a social network
Use the topological overlap measure based distance proposed by Ravasz et al 2002 Science)
33. Topological Overlap (Ravasz et al) leads to a network distance measure Generalized in Zhang and Horvath (2005) to the case of weighted networks
Generalized in Yip and Horvath (2005) to higher order interactions
34. Using the TOM matrix to cluster genes To group nodes with high topological overlap into modules (clusters), we typically use average linkage hierarchical clustering coupled with the TOM distance measure.
Once a dendrogram is obtained from a hierarchical clustering method, we choose a height cutoff to arrive at a clustering.
Here modules correspond to branches of the dendrogram
35. More traditional view of module
37. Hub Genes Predict Survival for Brain Cancer PatientsMischel PS, Zhang B,et al, Horvath S, Nelson SF.
39. Mean Prognostic Significance of Module Genes
40. Module hub genes predict cancer survival Cox model to regress survival on gene expression levels
Defined prognostic significance as –log10(Cox-p-value) the survival association between each gene and glioblastoma patient survival
A module-based measure of gene connectivity significantly and reproducibly identifies the genes that most strongly predict patient survival
41. The fact that genes with high intramodular connectivity are more likely to be prognostically significant facilitates a novel screening strategy for finding prognostic genes Focus on those genes with significant Cox regression p-value AND high intramodular connectivity.
It is essential to to take a module centric view: focus on intramodular connectivity of disease related module
Validation success rate= proportion of genes with independent test set Cox regression p-value<0.05.
Validation success rate of network based screening approach (68%)
Standard approach involving top 300 most significant genes: 26%
42. Validation success rate of gene expressions in independent data
43. New Application:Tissue Microarray Data
44. Tissue MicroarrayDNA Microarray
45. Tissue Array Section An H&E stained slide produced from a tissue array block is depicted in this diagram. When the arrayed pattern is magnified, the individual circular portions of tissue are clearly seen.An H&E stained slide produced from a tissue array block is depicted in this diagram. When the arrayed pattern is magnified, the individual circular portions of tissue are clearly seen.
46. Ki-67 Expression in Kidney Cancer Histomorphologic illustration of increasing Ki-67 protein staining in high, compared to low grade renal cell carcinomas.Histomorphologic illustration of increasing Ki-67 protein staining in high, compared to low grade renal cell carcinomas.
47. Multiple measurements per patient:Several spots per tumor sample and several “scores” per spot
48. Properties of TMA Data Highly skewed, non-normal,semi-continuous.
Often a good idea to model as ordinal variables with many levels.
Staining scores of the same markers are highly correlated
50. Frequency plot of the same tumor marker in 2 independent data sets
51. Thresholding methods for tumor marker expressions Since clinicians and pathologists prefer thresholding tumor marker expressions, it is natural to use statistical methods that are based on thresholding covariates, e.g. regression trees, survival trees, rpart, forest predictors etc.
Dichotomized marker expressions are often fitted in a Cox (or alternative) regression model
Danger: Over-fitting due to optimal cut-off selection.
Several thresholding methods and ways for adjusting for multiple comparisons are reviewed in
Liu X, Minin V, Huang Y, Seligson DB, Horvath S (2004) Statistical Methods for Analyzing Tissue Microarray Data. J of Biopharmaceutical Statistics. Vol 14(3) 671-685
52. Finding tumor markers for predicting clinical outcomes on the basis of Tissue Microarray Data
53. Using the clustering based strategy for finding tumor markers 1) Find distinct patient clusters without regard to outcome
2) Find whether patient clusters have distinct PSA recurrence profiles
3) If so, find rules (classifiers) for predicting cluster membership
4) Validate those rules in independent data.
55. Cluster Analysis of Low Gleason Score Prostate Samples(UCLA data)
56. 1) Construct a tumor marker rule for predicting RF cluster membership.2) Validate the rule predictions in an independent data set
57. Discussion Prostate TMA Data Very weak evidence that individual markers predict PSA recurrence
None of the markers validated individually
However, cluster membership was highly predictive, i.e the rule could be validated in an independent data set.
58. How to cluster patients on the basis of Tissue Microarray Data?
59. Questions: 1)Can TMA data be used for tumor class discovery, i.e unsupervised learning?2) If so, what are suitable unsupervised learning methods?
60. Tumor Class Discovery using DNA Microarray Data Tumor class discovery entails using a unsupervised learning algorithm (i.e. hierarchical, k-means, SOM clustering etc.) to automatically group tumor samples based on their gene expression pattern.
61. Clusters involving TMA data may have unconventional shapes:Low risk prostate cancer patients are colored in black.
62. Unconventional shape of a clinically meaningful patient cluster 3 dimensional scatter plot along tumor markers
Low risk patients are colored in black
63. A dissimilarity measure is an essential input for tumor class discovery
Dissimilarities between tumor samples are used in clustering and other unsupervised learning techniques
Commonly used dissimilarity measures include Euclidean distance, 1 - correlation
64. Challenge Conventional dissimilarity measures that work for DNA microarray data may not be optimal for TMA data.
Dissimilarity measure that are based on the intuition of multivariate normal distributions (clusters have elliptical shapes) may not be optimal
For tumor marker data, one may want to use a different intuition: clusters are described using thresholding rules involving dependent markers.
It may be desirable to have a dissimilarity that is invariant under monotonic transformations of the tumor marker expressions.
65. We have found that a random forest (Breiman 2001) dissimilarity can work well in the unsupervised analysis of TMA data.Shi et al 2004, Seligson et al 2005.http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
66. Kidney cancer:Comparing PAM clusters that result from using the RF dissimilarity vs the Euclidean distance
67. The RF dissimilarity is determined by dependent tumor markers The RF dissimilarity focuses on the most dependent markers (1,2).
In some applications, it is good to focus on markers that are dependent since they may constitute a disease pathway.
The Euclidean distance focuses on the most varying marker (4)
68. The RF cluster can be described using a thresholding rule involving the most dependent markers Low risk patient if marker1>cut1 & marker2> cut2
This kind of thresholding rule can be used to make predictions on independent data sets.
Validation on independent data set
69. Theoretical reasons for using an RF dissimilarity for TMA data Main reasons
natural way of weighing tumor marker contributions to the dissimilarity
The more related a tumor marker is to other tumor markers the more it contributes to the definition of the dissimilarity
no need to transform the often highly skewed features
based feature ranks
Chooses cut-off values automatically
resulting clusters can often be described using simple thresholding rules
Other reasons
elegant way to deal with missing covariates
intrinsic proximity matrix handles mixed variable types well
CAVEAT: The choice of the dissimilarity should be determined by the kind of patterns one hopes to find. There will be situations when other dissimilarities are preferrable.
70. The random forest dissimilarityL. Breiman: RF manualTechnical Report: Shi and Horvath 2005http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
71. Summary:Random forest clustering Intrinsic variable selection focuses on dependent variables
Depending on the application, this can be attractive
Resulting clusters can often be described using thresholding rules?attractive for TMA data.
RF dissimilarity invariant to monotonic transformations of variables
In some cases, the RF dissimilarity can be approximated using a Euclidean distance of ranked and scaled features.
RF clustering was originally suggested by L. Breiman (RF manual). Theoretical properties are studied as part of the dissertation work of Tao Shi. Technical report/code can be found at www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm www.genetics.ucla.edu/labs/horvath/kidneypaper/RCC.htm
72. Conclusions There is a need to identify/develop appropriate data mining methods for TMA data
highly skewed, semi-continuous, non-normal data
tree or forest based methods work well
ALTERNATIVES?
73. Acknowledgements Former students & Postdocs for TMA
Tao Shi PhD
Xueli Liu PhD
Yunda Huang PhD
Tuyen Hoang PhD UCLA
Tissue Microarray Core
David Seligson, MD
Hyung Kim, MD
Arie Belldegrun, MD
Robert Figlin, MD
Siavash Kurdistani, MD
74. References RF clustering Unsupervised learning tasks in TMA data analysis
Review random forest predictors (introduced by L. Breiman)
Shi, T. and Horvath, S. (2005) “Unsupervised learning using random forest predictors” Journal of Computational and Graphical Statistics
www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
Application to Tissue Array Data
Shi, T., Seligson, D., Belldegrun, A. S., Palotie, A., Horvath, S. (2004) Tumor Profiling of Renal Cell Carcinoma Tissue Microarray Data
Seligson DB, Horvath S, Shi T, Yu H, Tze S, Grunstein M, Kurdistani S (2005) Global histone modification patterns predict risk of prostate cancer recurrence.