1 / 54

Lab 1

Lab 1. Getting started with Basic Learning Machines and the Overfitting Problem. Lab 1. Polynomial regression. Matlab: POLY_GUI. The code implements the ridge regression algorithm: w =argmin S i (1-y i f( x i )) 2 + g || w || 2 f( x ) = w 1 x + w 2 x 2 + … + w n x n = w x T

Download Presentation

Lab 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem

  2. Lab 1 Polynomial regression

  3. Matlab: POLY_GUI • The code implements the ridge regression algorithm: w=argmin Si (1-yi f(xi))2 + g|| w ||2 f(x) = w1 x + w2 x2 + … + wn xn = wxT x = [x, x2, … , xn] wT = X+Y X+= XT(XXT+g)-1=(XTX+ g)-1XT X=[x(1); x(2); … x(p)] (matrix (p, n)) • The leave-one-out error (LOO) is obtained with PRESS statistic (Predicted REsidual Sums of Squares.): • LOO error = (1/p) Sk[ rk/1-(XX+)kk ]2

  4. Matlab: POLY_GUI

  5. Matlab: POLY_GUI • At the prompt type: poly_gui; • Vary the parameters. Refrain from hitting “CV”. Explain what happens in the following situations: • Sample num. << Target degree (small noise) • Large noise, small sample num • Target degree << Model degree • Why is the LOO error sometimes larger than the training and test error? • Are there local minima in the LOO error? Is the LOO error flat near the optimum? • Propose ways of getting a better solution.

  6. CLOP Data Objects The poly_gui emulates CLOP objects of type “data”: • X = rand(10,5) • Y = rand(10,1) • D = data(X,Y) % constructor • methods(D) • get_x(D) • get_y(D) • plot(D);

  7. CLOP Model Objects poly_ridge is a “model” object. • P = poly_ridge; h = plot(P); • D = gene(P); plot(D, h); • [resu, P] = train(P, D); • mse(resu) • Dt = gene(P); • [tresu, P] = test(P, Dt); • mse(tresu) • plot(P, h);

  8. Lab 1 Support Vector Machines

  9. Support Vector Classifier x2 f(x)<0 f(x)>0 f(x) = S aiyi k(x, xi) k  SV x=[x1,x2] f(x)=0 x1 Boser-Guyon-Vapnik-1992

  10. Matlab: SVC_GUI • At the prompt type: svc_gui; • The code implements the Support Vector Machine algorithm with kernel k(s, t) = (1 + s t)q exp -g||s-t||2 • Regularization similar to ridge regression: Hinge loss: L(xi)=max(0, 1-yi f(xi))b Empirical risk: Si L(xi) w=argmin (1/C)||w||2 + Si L(xi) shrinkage

  11. Lab 1 More loss functions…

  12. Loss Functions L(y, f(x)) Decision boundary Margin SVC loss, b=2 max(0, (1- z))2 Adaboost loss e-z logistic loss log(1+e-z) square loss (1- z)2 SVC loss, b=1 max(0, 1-z) 0/1 loss Perceptron loss max(0, -z) z=y f(x) missclassified well classified

  13. Exercise: Gradient Descent • Linear discriminant f(x) = Sj wj xj • Functional margin z=y f(x), y=1 • Compute z/ wj • Derive the learning rules Dwj=-h L/wj corresponding to the following loss functions: SVC loss max(0, 1-z) Adaboost loss e-z square loss (1- z)2 logistic loss log(1+e-z) Perceptron loss max(0, -z)

  14. Exercise: Dual Algorithms • From the Dwj derive the Dw • w = Siaixi • From the Dw, derive the Dai of the dual algorithms.

  15. Summary • Modern ML algorithms optimize a penalized risk functional:

  16. Lab 2 Getting started with CLOP

  17. Lab 2 CLOP tutorial

  18. What is CLOP? • CLOP=Challenge Learning Object Package. • Based on the Spider developed at the Max Planck Institute. • Two basic abstractions: • Data object • Model object • Put the CLOP directory in your path. • At the prompt type: use_spider_clop; • If you have used before poly_gui… type clear classes

  19. CLOP Data Objects At the Matlab prompt: • addpath(<clop_dir>); • use_spider_clop; • X=rand(10,8); • Y=[1 1 1 1 1 -1 -1 -1 -1 -1]'; • D=data(X,Y); % constructor • [p,n]=get_dim(D) • get_x(D) • get_y(D)

  20. CLOP Model Objects D is a data object previously defined. • model = kridge; % constructor • [resu, model] = train(model, D); • resu, model.W, model.b0 • Yhat = D.X*model.W' + model.b0 • testD = data(rand(3,8), [-1 -1 1]'); • tresu = test(model, testD); • balanced_errate(tresu.X, tresu.Y)

  21. Hyperparameters and Chains A model often has hyperparameters: • default(kridge) • hyper = {'degree=3', 'shrinkage=0.1'}; • model = kridge(hyper); • model = chain({standardize,kridge(hyper)}); • [resu, model] = train(model, D); • tresu = test(model, testD); • balanced_errate(tresu.X, tresu.Y) Models can be chained:

  22. Hyper-parameters • Kernel methods: kridge and svc: k(x, y) = (coef0 + xy)degree exp(-gamma ||x - y||2) kij = k(xi, xj) kii kii + shrinkage • Naïve Bayes: naive: none • Neural network: neural units, shrinkage, maxiter • Random Forest: rf (windows only) mtry

  23. Exercise • Here some the pattern recognition CLOP objects: @rf @naive @svc @neural @gentleboost @lssvm @gkridge @kridge @klogistic @logitboost • Try at the prompt example(neural) • Try other pattern recognition objects • Try different sets of hyperparameters, e.g., example(svc({'gamma=1', 'shrinkage=0.001'})) • Remember: use default(method) to get the HP.

  24. Lab 2 Example: Digit Recognition Subset of the MNIST data of LeCun and Cortes used for the NIPS2003 challenge

  25. data(X, Y) % Go to the Gisette directory: • cd('GISETTE') % Load “validation” data: • Xt=load('gisette_valid.data'); • Yt=load('gisette_valid.labels'); % Create a data object % and examine it: • Dt=data(Xt, Yt); • browse(Dt, 2); % Load “training” data (longer): • X=load('gisette_train.data'); • Y=load('gisette_train.labels'); • [p, n]=get_dim(Dt); • D=train(subsample(['p_max=' num2str(p)]), data(X, Y)); • clear X Y Xt Yt % Save for later use: • save('gisette', 'D', 'Dt');

  26. model(hyperparam) % Define some hyperparameters: • hyper = {'degree=3', 'shrinkage=0.1'}; % Create a kernel ridge % regression model: • model = kridge(hyper); % Train it and test it: • [resu, Model] = train(model, D); • tresu = test(Model, Dt); % Visualize the results: • roc(tresu); • idx=find(tresu.X.*tresu.Y<0); • browse(get(D, idx), 2);

  27. Exercise • Here are some pattern recognition CLOP objects: @rf @naive @gentleboost @svc @neural @logitboost @kridge @lssvm @klogistic • Instanciate a model with some hyperparameters (use default(method) to get the HP) • Vary the HP and the number of training examples (Hint: use get(D, 1:n) to restrict the data to n examples).

  28. chain({model1, model2,…}) % Combine preprocessing and kernel ridge regression: • my_prepro=normalize; • model = chain({my_prepro,kridge(hyper)}); % Combine replicas of a base learner: • for k=1:10 • base_model{k}=neural; • end • model=ensemble(base_model); ensemble({model1, model2,…})

  29. Exercise • Here are some preprocessing CLOP objects: @normalize @standardize @fourier • Chain a preprocessing and a model, e.g., • model=chain({fourier, kridge('degree=3')}); • my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'}); • model=chain({normalize, my_classif}); • Train, test, visualize the results. Hint: you can browse the preprocessed data: • browse(train(standardize, D), 2);

  30. Summary % After creating your complex model, just one command: train • model=ensemble({chain({standardize,kridge(hyper)}),chain({normalize,naive})}); • [resu, Model] = train(model, D); % After training your complex model, just one command: test • tresu = test(Model, Dt); % You can use a “cv” object to perform cross-validation: • cv_model=cv(model); • [resu, Model] = train(model, D); • roc(resu);

  31. Lab 3 Getting started with Feature Selection

  32. POLY_GUI again… • clear classes • poly_gui; • Check the “Multiplicative updates” (MU) box. • Play with the parameters. • Try CV • Compare with no MU

  33. Lab 3 Exploring feature selection methods

  34. Re-load the GISETTE data % Start CLOP: • clear classes • use_spider_clop; % Go to the Gisette directory: • cd('GISETTE') • load('gisette');

  35. Visualization 1) Create a heatmap of the data matrix or a subset: show(D); show(get(D,1:10, 1:2:500)); 2) Look at individual patterns: browse(D); browse(D, 2); % For 2d data % Display feature positions: browse(D, 2, [212, 463, 429, 239]); 3) Make a scatter plot of a few features:scatter(D, [212, 463, 429, 239]);

  36. Example • my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'}); • model=chain({normalize, s2n('f_max=100'), my_classif}); • [resu, Model] = train(model, D); • tresu = test(Model, Dt); • roc(tresu); % Show the misclassified first • [s,idx]=sort(tresu.X.*tresu.Y); • browse(get(Dt, idx), 2, Model{2});

  37. Some Filters in CLOP Univariate: • @s2n (Signal to noise ratio.) • @Ttest (T statistic; similar to s2n.) • @Pearson (Uses Matlab corrcoef. Gives the same results as Ttest, classes are balanced.) • @aucfs (ranksum test) Multivariate: • @relief (no elimination of redundancy) • @gs (Gram-Schmidt orthogonalization; complementary features)

  38. Exercise • Change the feature selection algorithm • Visualize the features • What can you say of the various methods? • Which one gives the best results for 2, 10, 100 features? • Can you improve by changing the preprocessing? (Hint: try @pc_extract)

  39. Lab 3 Feature significance

  40. T-test m- m+ P(Xi|Y=1) P(Xi|Y=-1) -1 xi s- s+ • Normally distributed classes, equal variance s2 unknown; estimated from data as s2within. • Null hypothesis H0: m+ = m- • T statistic: If H0 is true, • t= (m+ - m-)/(swithin1/m++1/m-) Student(m++m--2 d.f.)

  41. Evalution of pval and FDR • Ttest object: • computes pval analytically • FDR~pval*nsc/n • probe object: • takes any feature ranking object as an argument (e.g. s2n, relief, Ttest) • pval~nsp/np • FDR~pval*nsc/n

  42. Analytic vs. probe 1 0.9 0.8 0.7 0.6 FDR 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 rank

  43. Example • [resu, FS] = train(Ttest, D); • [resu, PFS] = train(probe(Ttest), D); • figure('Name', 'pvalue'); • plot(get_pval(FS, 1), 'r'); • hold on; plot(get_pval(PFS, 1)); • figure('Name', 'FDR'); • plot(get_fdr(FS, 1), 'r'); • hold on; plot(get_pval(PFS, 1));

  44. Exercise • What could explain differences between the pvalue and fdr with the analytic and probe method? • Replace Ttest with chain({rmconst('w_min=0'), Ttest}) • Recompute the pvalue and fdr curves. What do you notice? • Choose an optimum number fnum of features based on pvalue or FDR. Visualize with browse(D, 2,FS, fnum); • Create a model with fnum. Is fnum optimal? Do you get something better with CV?

  45. Lab 3 Local feature selection

  46. Exercise Consider the 1 nearest neighbor algorithm. We define the following score: Where s(k) (resp. d(k)) is the index of the nearest neighbor of xk belonging to the same class (resp. different class) as xk.

  47. Exercise • Motivate the choice of such a cost function to approximate the generalization error (qualitative answer) • How would you derive an embedded method to perform feature selection for 1 nearest neighbor using this functional? • Motivate your choice (what makes your method an ‘embedded method’ and not a ‘wrapper’ method)

  48. Relief Relief=<Dmiss/Dhit> Local_Relief= Dmiss/Dhit nearest hit Dhit Dmiss nearest miss Dhit Dmiss

  49. Exercise • [resu, FS] = train(relief, D); • browse(D, 2,FS, 20); • [resu, LFS] = train(local_relief,D); • browse(D, 2,LFS, 20); • Propose a modification to the nearest neighbor algorithm that uses features relevant to individual patterns (like those provided by “local_relief”). • Do you anticipate such an algorithm to perform better than the non-local version using “relief”?

  50. Epilogue Becoming a pro and playing with other datasets

More Related