170 likes | 352 Views
Machine Learning CSE 681. CH2 - Supervised Learning Model Assesment. Model Assesment. We can measure the generalization ability ( performance) of a hypothesis, namely, the quality of its inductive bias, if we have access to data outside the training set.
E N D
Machine LearningCSE 681 CH2 - SupervisedLearningModel Assesment
Model Assesment • We can measure the generalization ability (performance)of a hypothesis, namely, thequality of its inductive bias, if we have access to data outside the trainingset. • We simulate this by dividing the training set we have into two parts.We use one part for training (i.e., to fit a hypothesis), and the remainingpart is called the validation set (V) and is used to test the generalizationability. • Given models (hypotheses) h1, ..., hk induced from the training set X, we can use E(hi | V ) to select the model hi with the smallest generalization error
Model Assesment • If we need to report the error to give an idea about the expectederror (performance)of our best model, we should not use the validation error. • We have used the validation set to choose the best model, and it has effectively become a part of the training set. We need a third set, a test set,sometimes also called the publication set, containing examples not usedin training or validation.
Model Assesment • To estimate generalization error (performance) , we need data unseen during training. • Standard setup: We split the data as • Training set (50-80%) • Validation set (10-25%) • Test (publication) set (10-25%) • Note: • Validation data can be added to training set before testing • Resampling methods can be used if data is limited
Model Assesment • In practice, if we are using a machine learning tool, almostinvariably, all clasifiers have one or more parameters • The number of neighbors in a kNNclassification. • The network size, learning parameters and weights in Neural Networks (MLP) • How do we select the “optimal” parameter(s) or model for a given classification problem? • Methods: • Three-way data splits • Two-way data splits
Model Assesment with three-way data splits • If model selection and true error estimates are to be computedsimultaneously, the data needs to be divided into three disjoint sets • Training set: a set of examples used for learning: to fit the parameters of the classifier • In the MLP case, we would use the training set to find the “optimal” weights with the back-prop rule. • Validation set: a set of examples used to tune the parameters of ofa classifier • In the MLP case, we would use the validation set to find the “optimal”number of hidden units or determine a stopping point for the back propagation algorithm • Test set: a set of examples used only to assess the performance of a fully-trained classifier • In the MLP case, we would use the test to estimate the error rate after wehave chosen the final model (MLP size and actual weights) • After assessing the final model with the test set, YOU MUST NOT further tune the model Source: Ricardo Gutierrez-Osuna
Model Assesment with three-way data splits Source: Ricardo Gutierrez-Osuna
Model Assesment with two-way data splits • We can prefer two-way data split approach if we are using a machine learning tool with standard parameters. • Two-way data split methods: • Holdout method • Cross Validation • Random Subsampling • K-Fold Cross-Validation • Leave-one-out Cross-Validation • Bootstrap • In two-way data splits, we split the training data into disjoint subsets. We call them as traning set and test set.
The holdout method • Split dataset into two groups (%70, %30) • Training set: used to train the classifier • Test set: used to estimate the error rate of the trained classifier • A typical application the holdout method is determining a stoppingpoint for the back propagation error
The holdout method • The holdout method has two basic drawbacks • In problems where we have a sparse dataset we may not be able to afford the“luxury” of setting aside a portion of the dataset for testing • Since it is a single train-and-test experiment, the holdout estimate of error ratewill be misleading if we happen to get an “unfortunate” split • The limitations of the holdout can be overcome with a family ofresamplingmethods at the expense of more computations • Cross Validation • Random Subsampling • K-Fold Cross-Validation • Leave-one-out Cross-Validation • Bootstrap Source: Ricardo Gutierrez-Osuna
Random subsampling • Random subsampling performs K data splits of the entire dataset • Each data split randomly selects a (fixed) number of examples without replacement • For each data split we retrain the classifier from scratch with the training examples and then estimate 𝐸𝑖 with the test examples • The true error estimate is obtained as the average of the separate estimates 𝐸𝑖 • This estimate is significantly better than the holdout estimate Source: Ricardo Gutierrez-Osuna
K-Fold Cross-Validation • Divide the data set into K partitions (K-fold) • For each of K experiments, use K-1 folds for training and the remaining one for testing (K=4 in the following example) • The advantage of K-Fold Cross validation is that all the examples in thedataset are eventually used for both training and testing • As before, the true error is estimated as the average error rate
Leave-one-out Cross Validation • Leave-one-out is the degenerate case of K-Fold CrossValidation, where K is chosen as the total number of examples • For a dataset with N examples, perform N experiments • For each experiment use N-1 examples for training and the remaining example for testing • As usual, the true error is estimated as the average error rate on test examples
How many folds are needed? • With a large number of folds • +The bias of the true error rate estimator will be small (the estimator will be very accurate) • –The variance of the true error rate estimator will be large • –The computational time will be very large as well (many experiments) • With a small number of folds • +The number of experiments and, therefore, computation time are reduced • +The variance of the estimator will be small • –The bias of the estimator will be large (conservative or larger than the true error rate) • In practice, the choice for K depends on the size of the dataset • –For large datasets, even 3-fold cross validation will be quite accurate • –For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible • A common choice for is K=10
The bootstraping • The bootstrap is a resampling technique with replacement • From a dataset with 𝑁 examples • Randomly select (with replacement) 𝑁 examples and use this set for training • The remaining examples that were not selected for training are used for testing • This value is likely to change from fold to fold • Repeat this process for a specified number of folds (𝐾) • As before, the true error is estimated as the average error rate on test data
The bootstrap • By generating new training sets of size N from datasetby random sampling with replacement, the probabilitythat we do not pick an instance after N draws is • that is, only 36.8% of instances are new! • Compared to basic CV, the bootstrap increases the variance that can occur in each fold [Efron and Tibshirani, 1993] • This is a desirable property since it is a more realistic simulation of the real-life experiment from which our dataset was obtained • Consider a classification problem with 𝐶 classes, a total of 𝑁 examples and 𝑁𝑖 examples for each class 𝝎𝑖 • –The a priori probability of choosing an example from class 𝜔𝑖 is 𝑁𝑖/𝑁 • Once we choose an example from class 𝜔𝑖, if we do not replace it for the next selection, then the a priori probabilities will have changed since the probability of choosing an example from class 𝜔𝑖 will now be (𝑁𝑖−1)/𝑁 • Thus, sampling with replacement preserves the a priori probabilities of the classes throughout the random selection process • An additional benefit is that the bootstrap can provide accurate measures of BOTH the bias and variance of the true error estimate