1 / 38

Machine Learning in Practice Lecture 26

Learn about optimization techniques in machine learning, including tuning parameter settings and finding optimal models. Gain insights from the lecture and the importance of different views on the same data.

lsylvester
Download Presentation

Machine Learning in Practice Lecture 26

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning in PracticeLecture 26 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

  2. Plan for the day • Announcements • Questions? • Readings for next 3 lectures on Blackboard • Mid-term Review

  3. Locally Optimal Solutions http://biology.st-andrews.ac.uk/vannesmithlab/simanneal.png

  4. What do we learn from this? • No algorithm is guaranteed to find the globally optimal solution • Some algorithms or variations on algorithms may do better on one set just because of where in the space they started • Instability can be exploited • Noise can put you in a different starting place • Different views on the same data is useful • When you tune, you need to carefully avoid overfitting to flukes in your data

  5. Optimization

  6. 1 2 3 4 5 Optimizing Parameter Settings Test • This approach assumes that • you want to estimate the • generalization you will get from your • learning and tuning approach • together. • If you just want to know what the best • performance you can get on *this* set • by tuning, you can just use standard • cross-validation Validation Train

  7. Overview of Optimization • Stage 1: Estimate Tuned Performance • On each fold, test all versions of algorithm over training data to find optimal one for that fold • Train model with optimal setting over training data • Apply that model to the testing data for that fold • Do for all folds and average across folds • Stage 2: Find Optimal settings over whole set • Test each version of the algorithm using cross-validation over the whole set • Pick the one that works the best • But ignore the performance value you get! • Stage 3: Train Optimal Model over whole set

  8. Overview of Optimization • Stage 1: Estimate Tuned Performance • On each fold, test all versions of algorithm over training data to find optimal one for that fold • Train model with optimal setting over training data • Apply that model to the testing data for that fold • Do for all folds and average across folds • Stage 1 tells you how well the optimized model you will train in Stage 3 over the whole set will do on a new data set

  9. Overview of Optimization • Stage 2: Find Optimal settings over whole set • Test each version of the algorithm using cross-validation over the whole set • Pick the one that works the best • But ignore the performance value you get! • Stage 3: Train Optimal Model over whole set • The result of stage 3 is the trained, optimized model that you will use!!!

  10. Optimization in Weka • Divide your data into 10 train/test pairs • Tune parameters using cross validation on the training set (this is the inner loop) • Use those optimized settings on the corresponding test set • Note that you may have a different set of parameter setting for each of the 10 train/test pairs • You can do the optimization in the Experimenter

  11. Train/Test Pairs * Use the StratifiedRemoveFolds filter

  12. Setting Up for Optimization * Prepare to save the results • Load in training sets for • all folds • We’ll use cross validation • Within training folds to • Do the optimization

  13. What are we optimizing? Let’s optimize the confidence factor. Let’s try .1, .25, .5, and .75

  14. Add Each Algorithm to Experimenter Interface

  15. Look at the Results * Note that optimal setting varies across folds.

  16. Apply the optimized settings on each fold * Performance on Test1 using optimized settings from Train1

  17. Using CVParameterSelection

  18. Using CVParameterSelection

  19. Using CVParameterSelection You have to know what the command line options look like. You can find out on line or in the Experimenter Don’t forget to click Add!

  20. Using CVParameterSelection Best setting over whole set

  21. Using CVParameterSelection * Tuned performance.

  22. Non-linearity in Support Vector Machines

  23. Maximum Margin Hyperplanes Convex Hull • The maximum margin hyperplane is computed by taking the perpendicular • bisector of shortest line that connects the two convex hulls.

  24. Maximum Margin Hyperplanes Support Vectors Convex Hull • The maximum margin hyperplane is computed by taking the perpendicular • bisector of shortest line that connects the two convex hulls. • Note that the maximum margin hyperplane depends only on the support vectors, • which should be relatively few in comparison with the total set of data points.

  25. “The Kernel Trick”If your data is not linearly separable • Note that “the kernel trick” can be applied to other algorithms, like perceptron learners, • but they will not necessarily learn the maximum margin hyperplane.

  26. An example of a polynomial kernel function

  27. What is the connection between the meta-features we have been talking about under feature space design and kernel functions? Thought Question!

  28. Remember: Use just as much power as you need, and no more

  29. Similarity

  30. What does it mean for two vectors to be similar?

  31. What does it mean for two vectors to be similar? • If there are k attributes: • Squrt((a1 – b1)^2 + (a2 – b2)^2 … + (an –bn)^2) • For nominal attributes, difference is 0 when the values are the same and 1 otherwise • A common policy for missing values is that if either or both of the values being compared are missing, they are treated as different

  32. What does it mean for two vectors to be similar? • Cosine similarity = Dot(A,B)/Len(A)Len(B) • (a1b1 + a2b2 + … anbn) / Squrt(a1^2 + a2^2 + … an^2) Squrt(b1^2 …)

  33. What does it mean for two vectors to be similar? • Cosine similarity rates B and A as more similar than C and A • Euclidean distance rates C and A closer than B and A A B C

  34. What does it mean for two vectors to be similar? • Cosine similarity rates B and A as more similar than C and A • Euclidean distance also rates B and A closer than C and A A B C

  35. Remember! Different similarity metrics will lead to a different grouping of your instances! Think in terms of neighborhoods of instances…

  36. Feature Selection

  37. Why do irrelevant features hurt performance? • Divide-and-conquer approaches have the problem that the further down in the tree you get, the less data you are paying attention to • it’s easy for the classifier to get confused • Naïve Bayes does not have this problem, but it has other problems, as we have discussed • SVM is relatively good at ignoring irrelevant attributes, but it can still suffer • Also, it’s very computationally expensive with large attribute spaces

  38. Take Home Message • Good Luck!

More Related