170 likes | 406 Views
Model Selection. Manu Chandran. Outline. Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap For small data set - iris data set For large data set - ellipse data set
E N D
Model Selection Manu Chandran
Outline • Background and motivation • Over view of techniques • Cross validation • Bootstrap method • Setting up the problem • Comparing AIC,BIC,Crossvalidation,Bootstrap • For small data set - iris data set • For large data set - ellipse data set • Finding number of relevant parameters – cancer data set (from class text) • Conclusion
Background and Motivation • Model Selection • Parameters to change • Overview of error measures and when is it used • AIC -> Low data count, strives for less complexity • BIC -> High data count, less complexity • Cross validation • Boot strap methods
Motivation for Cross validation • Small number of data set • Enables re use of data. • Basic idea of cross validation • K fold cross-validation . • K = 5 in this example
Simple enough! What more ? • Points to consider • Why is it important ? • Finding the Test Error? • Selection of K-fold • What K is good enough for given data set ? • How is it important – bias, variance • Selection of features in “low data-high feature” problem • Important do’s and don’ts in feature selection when using cross validation • Finds application in bio informatics, where more than often number of parameters too high than data.
Overview of error terms • Recap from last class • In sample error : Errin • Expected Error : Err • Training error : err • True Error : ErrT • AIC and BIC attempts to find Errin • Crossvalidation attempts to find average error Err
Selection of K • K = N , N fold CV or Leave One Out • Unbiased • High varaince • K = 5, 5 fold CV • Lower variance • High Bias • Subset p means • best set of linear predictors
Selection of features using CV • Often finds application in bio informatics • One way of selecting predictors • Screen predictors which show high correlation with class labels • Build multivariate classifier • Use CV to find tuning parameter • Estimate prediction error of final model
The problem in this method • The CV is done after feature selection. This means the test samples had an effect on selecting predictors • Right way to do cross validation • Divide samples into K cross validation folds at random • Say for K = 5 • Find predictors based on the 4 training data • Using these predictors, tune the classifier with these 4 sets • Test on the left out 5th set
Boot strapping • Explanation of boot strapping
Probability of having ith sample in boot strap sample • Given by Poisson distribution with = 1 for large N • So Expectation of Error = 0.5*0.368 = 0.184 • Far below 0.5 • To avoid this leave one out boot strap is suggested