200 likes | 242 Views
Classification / Prediction. “Supervised Learning” Use a “training set” of examples to create a model that is able to predict, given an unknown sample, which of two or more classes that sample belongs to. Classification. ?. How to build a classifier. How to evaluate a classifier.
E N D
“Supervised Learning” Use a “training set” of examples to create a model that is able to predict, given an unknown sample, which of two or more classes that sample belongs to. Classification ?
How to build a classifier. How to evaluate a classifier. Using GenePattern to classify expression data. What we’ll cover
A predictive rule that uses a set of inputs(genes) to predict the values of the output(phenotype). Known examples (train data) are used to build the predictive rule. Goal: Achieve high generalization (predictive) power. Avoid over-fitting. What Is a Classifier
Classification Computational methodology Known Classes Data Assess Gene-Class Correlation Feature (Marker) Selection Build Classifier Test Classifier by Cross-Validation Evaluate Classifier on Independent Test Set
Regression Trees (CART) Weighted Voting k-Nearest Neighbors Support Vector Machines Classification Computational methodology Known Classes Expression Data Assess Gene-Class Correlation Feature Selection Build Classifier Test Classifier by Cross-Validation Evaluate Classifier on Independent Test Set
Important issues: Few cases, many variables (genes) redundancy: many highly correlated genes. noise: measurements are very imprecise. feature selection: reducing p is a necessity. Avoid over-fitting. Classifiers
project samples in gene space classblack gene 2 classorange gene 1 K-nn Classifier Example: K=5, 2 genes, 2 classes
? K-nn Classifier Example: K=5, 2 genes, 2 classes project unknown sample classblack gene 2 classorange gene 1
Distance measures: • Euclidean distance • 1-Pearson correlation • … ? K-nn Classifier Example: K=5, 2 genes, 2 classes "consult" 5 closest neighbors: -3 black - 2 orange classblack gene 2 classorange gene 1
Support Vector Machine (SVM) Noble, Nat Biotech 2006
Mixture of Experts approach: Each gene casts a vote for one of the possible classes. The vote is weighted by a score assessing the reliability of the expert (in this case, the gene). The class receiving the highest vote will be the predicted class. The vote can be used as a proxy for the probability of class membership (prediction strength). new sample Class 1centroid Class 2centroid g1 g2 … gi … gn-1 gn g1 g2 … gi … gn-1 gn ? Weighted Voting Slonim et al., RECOMB 2000
Regression Trees (CART) Weighted Voting k-Nearest Neighbors Support Vector Machines Class Prediction Computational methodology Expression Data Known Classes Assess Gene-Class Correlation Feature Selection Build Predictor Test Predictor by Cross-Validation Evaluate Predictor on Independent Test Set
Evaluation on independent test set Build the classifier on the train set. Assess prediction performance on test set. Maximize generalization/Avoid overfitting. Performance measure error rate = # of cases correctly classified total # of cases Testing the Classifier
Evaluation on independent test set What if we don’t have an independent test set? Cross Validation (XV): Split the dataset into n folds (e.g., 10 folds of 10 cases each). For each fold (e.g., for each group of 10 cases), train (i.e., build model) on n-1 folds (e.g., on 90 cases), test (i.e., predict) on left-out fold (e.g., on remaining 10 cases). Combine test results. Frequently, leave-one-out XV (when small sample size). Testing the Classifier
Testing the Classifier Learning curves – leave one out cross validation ALL vs. MLL vs. AML
Error rate estimate: Evaluate on independent test set: Best error estimate. Cross Validation Needed when small sample size and for model selection. Testing the Classifier
Start by splitting data into train and test set (stratified). “Forget” about the test set until the very end. Explore different feature selection methods and different classifiers on train set by XV. Once the “best” classifier and best classifier parameters have been selected (based on XV) Build a classifier with given parameters on entire train set. Apply classifier to test set. Classification Cookbook
Split data into train and test set – SplitDatasetTrainTest Explore different feature selection methods and different classifiers on train set by CV. CARTXValidation KNNXValidation WeightedVotingXValidation Once the “best” classifier and best classifier parameters have been selected (based on CV) CART KNN WeightedVoting Examine results: PredictionReultsViewer FeatureSummaryViewer Classification GenePattern methods