Data Mining CSCI 307, Spring 2019 Lecture 27

Data MiningCSCI 307, Spring 2019Lecture 27 Confidence Intervals Hold-Out, Leave-One-Out Cross Validation

5.3 What if Amount of Data is Limited? The holdout method reserves a certain amount for testing and uses the remainder for training • Often: one third for testing, the rest for training Problem: the samples might not be representative • i.e. a class might be missing in the test data Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets

Stratified Sampling • Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) Stratified Sample Raw Data

Repeated Holdout Method • Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratification) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method • Still not optimum: the different test sets overlap • Can we prevent overlapping?

Cross-Validation Practical method of choice in limited data situations • Cross-validation avoids overlapping test sets • First step: split data into k subsets of equal size • Second step: use each subset in turn for testing, the remainder for training • Called k-fold cross-validation • The error estimates are averaged to yield an overall error estimate

Cross-Validation continued • Standard method for evaluation: Stratified ten-fold cross-validation • Why ten? • Extensive experiments have shown that this is the best choice to get an accurate estimate • Also some theoretical evidence • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation • e.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

Leave-One-Out Cross-Validation Leave-One-Out is a form of cross-validation • Set number of folds to number of training instances • i.e., for n training instances, build classifier n times • Makes best use of the data • Involves no random subsampling • Computationally expensive • Cannot be stratified.........

L-O-O CV and Stratification • Disadvantage: stratification is not possible • It guarantees a non-stratified sample because there is only one instance in the test set • Extreme (and artificial) example: given a random dataset that is split equally into two classes • Best inducer predicts majority class, this gives a true error rate of 50% • BUT, when Leave-One-Out, select whatever the opposite class of the test instance is.....this ensures error 100% of the time

Summary:Holdout & Cross-Validation Methods Holdout method • Data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation • Random sampling:variation of holdout • Repeat holdout k times, accuracy = average of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) • Randomly partition the data into kmutually exclusive subsets, each approximately equal size • At i-th iteration, use Di as test set and others as training set • Leave-one-out:k folds where k = # of instances, for small sized data • *Stratified cross-validation*: folds are stratified so that class distribution in each fold is approximately the same as in the initial data 9

Data Mining CSCI 307, Spring 2019 Lecture 27

Data Mining CSCI 307, Spring 2019 Lecture 27

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

CSci 8980: Data Mining (Fall 2002)

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Mining Spring 2013

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007