380 likes | 397 Views
Learn about optimization techniques in machine learning, including tuning parameter settings and finding optimal models. Gain insights from the lecture and the importance of different views on the same data.
E N D
Machine Learning in PracticeLecture 26 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Plan for the day • Announcements • Questions? • Readings for next 3 lectures on Blackboard • Mid-term Review
Locally Optimal Solutions http://biology.st-andrews.ac.uk/vannesmithlab/simanneal.png
What do we learn from this? • No algorithm is guaranteed to find the globally optimal solution • Some algorithms or variations on algorithms may do better on one set just because of where in the space they started • Instability can be exploited • Noise can put you in a different starting place • Different views on the same data is useful • When you tune, you need to carefully avoid overfitting to flukes in your data
1 2 3 4 5 Optimizing Parameter Settings Test • This approach assumes that • you want to estimate the • generalization you will get from your • learning and tuning approach • together. • If you just want to know what the best • performance you can get on *this* set • by tuning, you can just use standard • cross-validation Validation Train
Overview of Optimization • Stage 1: Estimate Tuned Performance • On each fold, test all versions of algorithm over training data to find optimal one for that fold • Train model with optimal setting over training data • Apply that model to the testing data for that fold • Do for all folds and average across folds • Stage 2: Find Optimal settings over whole set • Test each version of the algorithm using cross-validation over the whole set • Pick the one that works the best • But ignore the performance value you get! • Stage 3: Train Optimal Model over whole set
Overview of Optimization • Stage 1: Estimate Tuned Performance • On each fold, test all versions of algorithm over training data to find optimal one for that fold • Train model with optimal setting over training data • Apply that model to the testing data for that fold • Do for all folds and average across folds • Stage 1 tells you how well the optimized model you will train in Stage 3 over the whole set will do on a new data set
Overview of Optimization • Stage 2: Find Optimal settings over whole set • Test each version of the algorithm using cross-validation over the whole set • Pick the one that works the best • But ignore the performance value you get! • Stage 3: Train Optimal Model over whole set • The result of stage 3 is the trained, optimized model that you will use!!!
Optimization in Weka • Divide your data into 10 train/test pairs • Tune parameters using cross validation on the training set (this is the inner loop) • Use those optimized settings on the corresponding test set • Note that you may have a different set of parameter setting for each of the 10 train/test pairs • You can do the optimization in the Experimenter
Train/Test Pairs * Use the StratifiedRemoveFolds filter
Setting Up for Optimization * Prepare to save the results • Load in training sets for • all folds • We’ll use cross validation • Within training folds to • Do the optimization
What are we optimizing? Let’s optimize the confidence factor. Let’s try .1, .25, .5, and .75
Look at the Results * Note that optimal setting varies across folds.
Apply the optimized settings on each fold * Performance on Test1 using optimized settings from Train1
Using CVParameterSelection You have to know what the command line options look like. You can find out on line or in the Experimenter Don’t forget to click Add!
Using CVParameterSelection Best setting over whole set
Using CVParameterSelection * Tuned performance.
Maximum Margin Hyperplanes Convex Hull • The maximum margin hyperplane is computed by taking the perpendicular • bisector of shortest line that connects the two convex hulls.
Maximum Margin Hyperplanes Support Vectors Convex Hull • The maximum margin hyperplane is computed by taking the perpendicular • bisector of shortest line that connects the two convex hulls. • Note that the maximum margin hyperplane depends only on the support vectors, • which should be relatively few in comparison with the total set of data points.
“The Kernel Trick”If your data is not linearly separable • Note that “the kernel trick” can be applied to other algorithms, like perceptron learners, • but they will not necessarily learn the maximum margin hyperplane.
What is the connection between the meta-features we have been talking about under feature space design and kernel functions? Thought Question!
What does it mean for two vectors to be similar? • If there are k attributes: • Squrt((a1 – b1)^2 + (a2 – b2)^2 … + (an –bn)^2) • For nominal attributes, difference is 0 when the values are the same and 1 otherwise • A common policy for missing values is that if either or both of the values being compared are missing, they are treated as different
What does it mean for two vectors to be similar? • Cosine similarity = Dot(A,B)/Len(A)Len(B) • (a1b1 + a2b2 + … anbn) / Squrt(a1^2 + a2^2 + … an^2) Squrt(b1^2 …)
What does it mean for two vectors to be similar? • Cosine similarity rates B and A as more similar than C and A • Euclidean distance rates C and A closer than B and A A B C
What does it mean for two vectors to be similar? • Cosine similarity rates B and A as more similar than C and A • Euclidean distance also rates B and A closer than C and A A B C
Remember! Different similarity metrics will lead to a different grouping of your instances! Think in terms of neighborhoods of instances…
Why do irrelevant features hurt performance? • Divide-and-conquer approaches have the problem that the further down in the tree you get, the less data you are paying attention to • it’s easy for the classifier to get confused • Naïve Bayes does not have this problem, but it has other problems, as we have discussed • SVM is relatively good at ignoring irrelevant attributes, but it can still suffer • Also, it’s very computationally expensive with large attribute spaces
Take Home Message • Good Luck!