230 likes | 396 Views
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. T.R. Golub et al., Science 286, 531 (1999). Introduction. Why is Identification of Cancer Class (tumor sub-type) important?
E N D
Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Introduction • Why is Identification of Cancer Class (tumor sub-type) important? • Cancers of Identical grade can have widely variable clinical courses (i.e. acute lymphoblastic leukemia, or Acute myeloid leukemia). • Tradition Method: • Morphological appearance. • Enzyme-based histochemical analyses. • Immunophenotyping. • Cytogenetic analysis.
Topics of Discussion • Class Prediction (supervised learning). • Class Discovery (unsupervised learning).
Class Prediction • How could one use an initial collection of samples belonging to know classes to create a class Predictor? • Identification of Informative Genes via Neighborhood Analysis. • Weighted Vote
Neighborhood Analysis • Why do we want to start with informative genes? • To be readily applied in a clinical setting. • Highly instructive
Neighborhood Analysis • v(g) = (e1, e2, ..., en) • c = (c1, c2, ..., cn) • Compute the correlation between v(g) and c. • Euclidean distance • Pearson correlation coefficient. • P(g,c) = [µ1(g) - µ2(g)]/[ σ1(g) + σ2(g)]
Class Predictor via Gene Voting • Parameters (ag, bg) are defined for each informative gene • ag = P(g,c) • bg = [µ1(g) + µ2(g)]/2 • vg = ag(xg - bg) • V1 = ∑ | Vg |; for Vg > 0 • V2 = ∑ | Vg |; for Vg < 0 • PS = (Vwin - Vlose)/(Vwin + Vlose) • The sample was assigned to the winning class for PS > threshold.
Data • Initial Sample: 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis. • Independent Sample: 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).
Validation of Gene Voting • Initial Samples: 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis. • Independent Samples: 29 of 34 samples are strongly predicted with 100% accuracy.
Class Discovery • Can cancer classes be discovered automatically based on gene expression? • Cluster tumors by gene expression • Determine whether the putative classes produced are meaningful.
Cluster tumors • Self-organization Map (SOM) • Mathematical cluster analysis for recognizing and clasifying feautres in complex, multidimensional data (similar to K-mean approach) • Chooses a geometry of “nodes” • Nodes are mapped into K-dimensional space, initially at random. • Iteratively adjust the nodes.
Adjusting the nodes • Randomly select a data point P. • Move the nodes in the direction of P. • The closest node Np is moved the most. • Other nodes are moved depending on their distance from Np in the initial geometry.
Validation of SOM • Prediction based on cluster A1 and A2: • 24/25 of the ALL samples from initial dataset were clustered in group A1 • 10/13 of the AML samples from initial dataset were clustered in group A2
Validation of SOM • How could one evaluate the putative cluster if the “right” answer were not known? • Assumption: class discovery could be tested by class prediction. • Testing of Assumption: • Construct Predictors based on clusters A1 and A2. • Construct Predictors based on random clusters
Validation of SOM • Predictions using predictors based on clusters A1 and A2 yields 34 accurate predictions, one error and three uncertains.
Searching for Finder Class • Use SOM to divide the initial samples into four clusters (denoted B1 to B4) • B1 corresponds to AML, B2 corresponds to T-lineage ALL, B3 and B4 corresponds to B-lineage ALL.