90 likes | 98 Views
Learn about the holdout and cross-validation methods for accurate model construction and error estimation when dealing with limited data. Understand the benefits of stratification and repeated iterations to improve reliability.
E N D
Data MiningCSCI 307, Spring 2019Lecture 27 Confidence Intervals Hold-Out, Leave-One-Out Cross Validation
5.3 What if Amount of Data is Limited? The holdout method reserves a certain amount for testing and uses the remainder for training • Often: one third for testing, the rest for training Problem: the samples might not be representative • i.e. a class might be missing in the test data Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets
Stratified Sampling • Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) Stratified Sample Raw Data
Repeated Holdout Method • Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratification) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method • Still not optimum: the different test sets overlap • Can we prevent overlapping?
Cross-Validation Practical method of choice in limited data situations • Cross-validation avoids overlapping test sets • First step: split data into k subsets of equal size • Second step: use each subset in turn for testing, the remainder for training • Called k-fold cross-validation • The error estimates are averaged to yield an overall error estimate
Cross-Validation continued • Standard method for evaluation: Stratified ten-fold cross-validation • Why ten? • Extensive experiments have shown that this is the best choice to get an accurate estimate • Also some theoretical evidence • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation • e.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
Leave-One-Out Cross-Validation Leave-One-Out is a form of cross-validation • Set number of folds to number of training instances • i.e., for n training instances, build classifier n times • Makes best use of the data • Involves no random subsampling • Computationally expensive • Cannot be stratified.........
L-O-O CV and Stratification • Disadvantage: stratification is not possible • It guarantees a non-stratified sample because there is only one instance in the test set • Extreme (and artificial) example: given a random dataset that is split equally into two classes • Best inducer predicts majority class, this gives a true error rate of 50% • BUT, when Leave-One-Out, select whatever the opposite class of the test instance is.....this ensures error 100% of the time
Summary:Holdout & Cross-Validation Methods Holdout method • Data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation • Random sampling:variation of holdout • Repeat holdout k times, accuracy = average of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) • Randomly partition the data into kmutually exclusive subsets, each approximately equal size • At i-th iteration, use Di as test set and others as training set • Leave-one-out:k folds where k = # of instances, for small sized data • *Stratified cross-validation*: folds are stratified so that class distribution in each fold is approximately the same as in the initial data 9