1 / 9

Data Mining CSCI 307, Spring 2019 Lecture 27

Learn about the holdout and cross-validation methods for accurate model construction and error estimation when dealing with limited data. Understand the benefits of stratification and repeated iterations to improve reliability.

inger
Download Presentation

Data Mining CSCI 307, Spring 2019 Lecture 27

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningCSCI 307, Spring 2019Lecture 27 Confidence Intervals Hold-Out, Leave-One-Out Cross Validation

  2. 5.3 What if Amount of Data is Limited? The holdout method reserves a certain amount for testing and uses the remainder for training • Often: one third for testing, the rest for training Problem: the samples might not be representative • i.e. a class might be missing in the test data Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets

  3. Stratified Sampling • Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) Stratified Sample Raw Data

  4. Repeated Holdout Method • Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratification) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method • Still not optimum: the different test sets overlap • Can we prevent overlapping?

  5. Cross-Validation Practical method of choice in limited data situations • Cross-validation avoids overlapping test sets • First step: split data into k subsets of equal size • Second step: use each subset in turn for testing, the remainder for training • Called k-fold cross-validation • The error estimates are averaged to yield an overall error estimate

  6. Cross-Validation continued • Standard method for evaluation: Stratified ten-fold cross-validation • Why ten? • Extensive experiments have shown that this is the best choice to get an accurate estimate • Also some theoretical evidence • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation • e.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

  7. Leave-One-Out Cross-Validation Leave-One-Out is a form of cross-validation • Set number of folds to number of training instances • i.e., for n training instances, build classifier n times • Makes best use of the data • Involves no random subsampling • Computationally expensive • Cannot be stratified.........

  8. L-O-O CV and Stratification • Disadvantage: stratification is not possible • It guarantees a non-stratified sample because there is only one instance in the test set • Extreme (and artificial) example: given a random dataset that is split equally into two classes • Best inducer predicts majority class, this gives a true error rate of 50% • BUT, when Leave-One-Out, select whatever the opposite class of the test instance is.....this ensures error 100% of the time

  9. Summary:Holdout & Cross-Validation Methods Holdout method • Data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation • Random sampling:variation of holdout • Repeat holdout k times, accuracy = average of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) • Randomly partition the data into kmutually exclusive subsets, each approximately equal size • At i-th iteration, use Di as test set and others as training set • Leave-one-out:k folds where k = # of instances, for small sized data • *Stratified cross-validation*: folds are stratified so that class distribution in each fold is approximately the same as in the initial data 9

More Related