550 likes | 666 Views
Classification and Clustering Methods for Gene Expression Kim-Anh Do, Ph.D. Department of Biostatistics The University of Texas M.D. Anderson Cancer Center. Advanced Statistical Methods for the Analysis of Gene Expression and Proteomics GSBS010103 and STAT675
E N D
Classification and Clustering Methods for Gene ExpressionKim-Anh Do, Ph.D.Department of BiostatisticsThe University of Texas M.D. Anderson Cancer Center Advanced Statistical Methods for the Analysis of Gene Expression and Proteomics GSBS010103 and STAT675 Part of these lecture notes are adapted from Hilsenbeck’s and Goldstein’s lecture notes
Tumor Classification Using Gene Expression Data Three main types of statistical problems associated with the microarray data: • Identification of “marker” genes that characterize the different tumor classes (feature or variable selection). • Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) • Classification of sample into known classes (supervised learning – classification)
G1 G6 G6 G5 G1 G2 G5 G2 G3 G4 G3 G4 Hierarchical Clustering 1. Calculate the distance between all genes. Find the smallest distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives 2. Fuse the two selected clusters to produce a new cluster that now contains at least two objects. Calculate the distance between the new cluster and all other clusters 3. Repeat steps 1 and 2 until only a single cluster remains 4. Draw a tree representing the results
Agglomerative Linkage Methods • Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. • Three linkage methods that are commonly used are: • Single Linkage • Average Linkage • Complete Linkage
Single Linkage Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. DAB = min ( d(ui, vj) ) where u A and v B for all i = 1 to NA and j = 1 to NB DAB
Average Linkage Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. DAB = 1/(NANB) S S ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB
Complete Linkage Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. DAB = max ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB
K-Means/Medians Clustering 1. Specify number of clusters, e.g., 5. 2. Randomly assign genes to clusters. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13
G3 G6 G1 G8 G4 G5 G2 G10 G9 G12 G13 G11 G7 K-Means/Medians ClusteringK-Means is most useful when the user has an a priori specified number of clusters for the genes • Specify number of clusters • Randomly assign genes to clusters • Calculate mean/median expression profile of each cluster • Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile is the closest to that gene’s expression profile 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached.
G7 G8 G1 G6 G5 G9 G2 N = Nodes G = Genes G4 G10 G3 G11 N1 N2 N3 N4 G12 G13 G14 G15 G26 G27 N5 N6 G28 G29 G16 G17 G19 G18 G23 G20 G21 G24 G22 G25 Self-organizing maps (SOMs) 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal
G7 G8 G1 G6 G5 G9 G2 G4 G10 G3 G11 N1 N2 N3 N4 G12 G13 G14 G15 G26 G27 N5 N6 G28 G29 G16 G17 G19 G18 G23 G20 G21 G24 G22 G25 SOMs • Choose a random gene, say, G9 • Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved.
G7 G8 G1 G6 G5 G9 N2 G2 N1 G4 G10 G3 G11 G12 G13 N4 G14 G15 G26 G27 N3 G16 G28 G29 G17 G19 G18 G23 G20 N6 G21 N5 G24 G22 G25 SOMs 4. Repeat Steps 2 and 3 several thousand times; with each iteration, the amount that the nodes are allowed to move is decreased. 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node.
Principal Components Analysis • PCA is a dimension reduction technique • Suppose we have measurements for each gene on multiple experiments • Suppose some of the experiments are correlated. • PCA will ignore the redundant experiments, and will take a weighted average of some of the experiments, thus possibly making the trends in the data more interpretable. • The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents a different trend in the data.
y x z “Cloud” of data points (e.g., genes) in N-dimensional space, N = # hybridizations Principal Components Analysis Data points summarized along 3 principal component axes. Example:x-axis could mean a continuum from over-to under-expression y-axis could mean that “blue” genes are over-expressed in first five expts and under expressed in the remaining expts, while “brown” genes are under-expressed in the first five expts, and over-expressed in the remaining expts. z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true for “purple” genes. Interpretation of components is somewhat subjective.
Other developments 1. Supervised principal components (Bair, Hastie, Paul, Tibshirani, JASA 2005) Software: http://www-stat.stanford.edu/~tibs/superpc/ 2. Graphical networks, Bayesian networks, relevance networks,…
B A D E B C D E C A Tmin = 0.50 .75 .92 .15 .37 Tmax = 0.90 .02 .28 .63 .51 Correlation coefficients outside the boundaries defined by the minimum and maximum thresholds are eliminated. .40 .11 Relevance Networks The expression pattern of each gene compared to that of every other gene The remaining relationships between genes define the subnets The ability of each gene to predict the expression of each other gene is assigned a correlation coefficient
Methods • Discriminant Analysis • Linear and quadratic discriminant analysis • Diagonal LDA, QDA – ignores correlations between genes, reducing number of parameters to estimate • Weighted voting • Compound covariate predictor • Linear support vector machines (SVM) • Nearest (shrunken) centroids (PAM) • Classification and regression trees • K Nearest neighbors • Others: Nonlinear SVM, Neural networks, etc Dudoit, S. and J. Fridlyand (2003). Classification in microarray experiments. Statistical Analysis of Gene Expression Microarray Data. T. P. Speed. New York, Chapman & Hall/CRC: 222. Simon, R., L. M. McShane, et al. (2003). Class Prediction. Design and Analysis of DNA Microarray Investigations. New York, Springer: 199.
Classification Y Normal Normal Normal Cancer Cancer unknown =Y_new • Each object (e.g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) • Aim: predict Y_new from X_new. sample1 sample2 sample3 sample4 sample5 … New sample 1 0.46 0.30 0.80 1.51 0.90 ... 0.34 2 -0.10 0.49 0.24 0.06 0.46 ... 0.43 3 0.15 0.74 0.04 0.10 0.20 ... -0.23 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... -0.91 5 -0.06 1.06 1.35 1.09 -1.09 ... 1.23 X X_new
Classification and Prediction:Supervised Learning • Start with cases in known classes • Select “features” that characterize the classes • Make a rule that let you use the values of the “features” to assign a class (minimize the “cost” of misclassification) • Apply the rule • Estimate the accuracy of classification (XX% CI) and/or test to compare to an existing classifier • Classification and prediction is an old area of statistics (LDA, Fisher, 1936), but still active area of research • Traditionally, there are more cases than features • Expression arrays, usually 100X more features than cases
Selecting a classification method • Best classifier is the one that makes the fewest mistakes (or least costly mistakes) • How should we estimate the accuracy of the classifier?
Resubstitution Estimate of Accuracy Training Cases This is invariably biased, reporting a much more favorable accuracy then is true Select features Make Classifier Use Classifier Class Predictions Report the AMAZING result!!!
External Validation Estimate of Accuracy Validation Cases Training Cases Select features Use Classifier Make Classifier This is unbiased, but may not be feasible on your budget. Class Predictions Report the less amazing but more generalizable result!!! Inference - XX%CI excludes Null rate, or explicit comparison to competing method
Cross-validation Estimate of Accuracy Validation Cases Training Cases N-k k Repeat This is usually feasible and can be unbiased, if done correctly. Leave In Cases Leave Out Cases Select features Use Classifier Use Classifier Make Classifier Essential to repeat every part of the process that is based on the data !!! Class Predictions Class Predictions For Affymetrix data, can normalize left out cases against a fixed baseline, estimate expression using “probe sensitivities”, then apply classifier.
Example of Accuracy Estimates • Select genes and create a classifier to distinguish between non-invasive dCIS and invasive breast cancer • Sample: • HgU95av2, 12,558 genes • dCIS, n=25 • IBC, n=25 • Randomly select half in each group to be the training sample and half to be the test sample • Train • Feature selection – use t-test to select 169 genes, =0.001 (FDR~6%) • Classifier construction - Linear Discriminant Analysis • Test – compute scores and predict class of test cases
0.6 0.4 0.2 0.0 -4 -2 0 2 4 LDA Scores of Training Samples Case Number Resubstitution prediction accuracy=100% Independent prediction accuracy=80% (sample size small, CI wide) 2-Fold CV prediction accuracy=88% LDA Scores 6
Discriminant Analysis • Discriminator is “linear” weighted combination of gene values • Various distance metrics • Many choices to make • Which features (genes) to use • Distance metric • Weights • Choices must be made without reference to “leave out” cases 100 ? 90 gene2 80 + - LDA Scores 70 9 10 11 12 13 14 15 gene1
Two groups of 7 cases each • Two genes (p<<0.001) • How can we use these two genes to construct a classifier to predict group? 100 90 gene2 80 70 9 10 11 12 13 14 15 gene1
100 90 gene2 80 70 9 10 11 12 13 14 15 gene1 CART gene1>11 Y N
CART 100 90 gene2 80 70 9 10 11 12 13 14 15 gene1 gene1>11 gene2>83 Y Y N
CART: Classification TreeBINARY RECURSIVE PARTITIONING TREE • Binary -- split parent node into two child nodes • Recursive -- each child node can be treated as parent node • Partitioning -- data set is partitioned into mutually exclusive subsets in each split -- L.Breiman, J.H. Friedman, R. Olshen, and C.J. Stone. Classification and regression trees. The Wadsworth statistics/probability series. Wadsworth International Group, 1984.
Classification Trees • Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself) • Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier • RPART in R or TREE in R
Three Aspects of Tree Construction • Split Selection Rule • Split-stopping Rule • Class assignment Rule • Different tree classifiers use different approaches to deal with these three issues, e.g. CART( Classification And Regression Trees)
Three Rules (CART) • Splitting: At each node, choose split maximizing decrease in impurity (e.g.Gini index, entropy, misclassification error). • Split-stopping: Grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate. • Class assignment: For each terminal node, choose the class with the majority vote.
Aggregating classifiers • Breiman (1996, 1998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set; the multiple versions of the predictor are aggregated by weighted voting. • Let C(., Lb) denote the classifier built from the b-th perturbed learning set Lb, and let wbdenote the weight given to predictions made by this classifier. The predicted class for an observation x is given by argmaxk ∑b wbI(C(x,Lb) = k) -- L. Breiman. Bagging predictors. Machine Learning, 24:123-140, 1996. -- L. Breiman. Out-of-bag eatimation. Technical report, Statistics Department, U.C. Berkeley, 1996. -- L. Breiman. Arcing classifiers. Annals of Statistics, 26:801-824, 1998.
Aggregating Classifiers • The key to improved accuracy is the possible instability of the prediction method, i.e., whether small changes in the learning set result in large changes in the predictor. • Unstable predictors tend to benefit the most from aggregation. • Classification trees (e.g.CART) tend to be unstable. • Nearest neighbor classifier tend to be stable.
Bagging & Boosting • Two main methods for generating perturbed versions of the learning set. • Bagging. -- L. Breiman. Bagging predictors. Machine Learning, 24:123-140, 1996. • Boosting. -- Y.Freund and R.E.Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55:119-139, 1997.
Bagging= Bootstrap aggregatingI. Nonparametric Bootstrap (BAG) • Nonparametric Bootstrap (standard bagging). • perturbed learning sets of the same size as the original learning set are formed by randomly selecting samples with replacement from the learning sets; • Predictors are built for each perturbed dataset and aggregated by plurality voting plurality voting (wb=1), i.e., the “winning” class is the one being predicted by the largest number of predictors.
Bagging= Bootstrap aggregatingII. Parametric Bootstrap (MVN) • Parametric Bootstrap. • Perturbed learning sets are generated according to a mixture of multivariate normal (MVN) distributions. • The conditional densities for each class is a multivariate Gaussian (normal), i.e., P(X|Y= k) ~ N(k,k), the sample mean vector and sample covariance matrix will be used to estimate the population mean vector and covariance matrix. • The class mixing probabilities are taken to be the class proportions in the actual learning set. • At least one observation be sampled from each class. • Predictors are built for each perturbed dataset and aggregated by plurality voting (wb=1).
Bagging= Bootstrap aggregatingIII. Convex pseudo-data (CPD) • Convex pseudo-data. One perturbed learning set are generated by repeating the following n times: • Select two samples (x,y) and (x’, y’) at random from the learning set L. • Select at random a number of v from the interval [0,d], 0<=d<=1, and let u=1-v. • The new sample is (x’’, y’’) where y’’=y and x’’=ux+vx’ • Note that when d=0, CPD reduces to standard bagging. • Predictors are built for each perturbed dataset and aggregated by plurality voting (wb=1).
Boosting • The perturbed learning sets are re-sampled adaptively so that the weights in the re-sampling are increased for those cases most often misclassified. • The aggregation of predictors is done by weighted voting (wb != 1).
Boosting • Learning set: L = (X1, Y1), ..., (Xn,Yn) • Re-sampling probabilities p={p1,…, pn}, initialized to be equal. • The bth step of the boosting algorithm is: • Using the current re-sampling prob p, sample with replacement from L to get a perturbed learning set Lb. • Build a classifier C(., Lb) based on Lb. • Run the learning set L through the classifier C(., Lb) and let di=1 if the ith case is classified incorrectly and let di=0 otherwise. • Define and update the re-sampling prob for the (b+1)st step by • The weight for each classifier is
100 ? 90 gene2 80 70 9 10 11 12 13 14 15 gene1 Nearest Neighbors • Majority or plurality voting • Various distance metrics • Often effective for small samples • K is usually 1 or 3, up to 7 • Many choices to make • Features to use • Distance metric • K • Choices must be made without reference to “leave out” cases K=1 K=5
Comparison of Methods • With modest sample sizes (<100), methods that DO NOT account for correlations among genes, interactions, or other structural complexity perform best • Studies to date do not favor a single best classifier • Estimates of classifier accuracy should account for ALL “training decisions" Dudoit, S., J. Fridlyand, et al. (2002). "Comparison of discrimination methods for the classification of tumors using gene expression data." J Am Statistical Assoc97(457): 77-87.
How bad could it be? S. Michiels, S. Koscielny, C. Hill, Lancet 365, 488-92 (Feb 5-11, 2005).
Comparison of classifiers • Dudoit, Fridlyand, Speed (JASA, 2002) • FLDA (Fisher Linear Discriminant Analysis) • DLDA (Diagonal Linear Discriminant Analysis) • DQDA (Diagonal Quantic Discriminant Analysis) • NN (Nearest Neighbour) • CART (Classification and Regression Tree) • Bagging and boosting • Bagging (Non-parametric Bootstrap ) • CPD (Convex Pseudo Data) • MVN (Parametric Bootstrap) • Boosting -- Dudoit, Fridlyand, Speed: “Comparison of discrimination methods for the classification of tumors using gene expression data”, JASA, 2002
Comparison study datasets • Leukemia – Golub et al. (1999) n = 72 samples, G = 3,571 genes 3 classes (B-cell ALL, T-cell ALL, AML) • Lymphoma – Alizadeh et al. (2000) n = 81 samples, G = 4,682 genes 3 classes (B-CLL, FL, DLBCL) • NCI 60 – Ross et al. (2000) N = 64 samples, p = 5,244 genes 8 classes
Procedure • For each run (total 150 runs): • 2/3 of sample randomly selected as learning set (LS), rest 1/3 as testing set (TS). • The top p genes with the largest BSS/WSS are selected using the learning set. • p=50 for lymphoma dataset. • p=40 for leukemia dataset. • p=30 for NCI 60 dataset. • Predictors are constructed and error rated are obtained by applying the predictors to the testing set.
Leukemia data, 2 classes: Test set error rates;150 LS/TS runs
Leukemia data, 3 classes: Test set error rates;150 LS/TS runs