1 / 33

Taming the Learning Zoo

Taming the Learning Zoo. Supervised Learning Zoo. Bayesian learning Maximum likelihood Maximum a posteriori Decision trees Support vector machines Neural nets k-Nearest-Neighbors. Very approximate “cheat-sheet” for techniques Discussed in Class. What haven’t we covered?. Boosting

buzz
Download Presentation

Taming the Learning Zoo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Taming the Learning Zoo

  2. Supervised Learning Zoo • Bayesian learning • Maximum likelihood • Maximum a posteriori • Decision trees • Support vector machines • Neural nets • k-Nearest-Neighbors

  3. Very approximate “cheat-sheet” for techniques Discussed in Class

  4. What haven’t we covered? • Boosting • Way of turning several “weak learners” into a “strong learner” • E.g. used in popular random forests algorithm • Regression: predicting continuous outputs y=f(x) • Neural nets, nearest neighbors work directly as described • Least squares, locally weighted averaging • Unsupervised learning • Clustering • Density estimation • Dimensionality reduction • [Harder to quantify performance]

  5. Agenda • Quantifying learner performance • Cross validation • Precision & recall • Model selection

  6. Cross-Validation

  7. Assessing Performance of a Learning Algorithm • Samples from X are typically unavailable • Take out some of the training set • Train on the remaining training set • Test on the excluded instances • Cross-validation

  8. - - + - + - - - - + + + + - - + + + + - - - + + Cross-Validation • Split original set of examples, train Examples D Train Hypothesis space H

  9. - - - - + + + + + + - - + Cross-Validation • Evaluate hypothesis on testing set Testing set Hypothesis space H

  10. Cross-Validation • Evaluate hypothesis on testing set Testing set - + + - + + + Test - + + - - - Hypothesis space H

  11. - - - - + + + + + + - - + Cross-Validation • Compare true concept against prediction 9/13 correct Testing set - + + - + + + - + + - - - Hypothesis space H

  12. Common Splitting Strategies • k-fold cross-validation Dataset Train Test

  13. Common Splitting Strategies • k-fold cross-validation • Leave-one-out (n-fold cross validation) Dataset Train Test

  14. Computational complexity • k-fold cross validation requires • k training steps on n(k-1)/k datapoints • k testing steps on n/k datapoints • (There are efficient ways of computing L.O.O. estimates for some nonparametric techniques, e.g. Nearest Neighbors) • Average results reported

  15. Bootstrapping • Similar technique for estimating the confidence in the model parameters • Procedure: • Draw k hypothetical datasets from original data. Either via cross validation or sampling with replacement. • Fit the model for each dataset to compute parameters k • Return the standard deviation of 1,…,k (or a confidence interval) Can also estimate confidence in a prediction y=f(x)

  16. Simple Example: average of N numbers • Data D={x(1),…,x(N)}, model is constant  • Learning: minimize E() = i(x(i)-)2=> compute average • Repeat for j=1,…,k : • Randomly sample subset x(1)’,…,x(N)’ from D • Learn j = 1/N i x(i)’ • Return histogram of 1,…,j

  17. Precision Recall Curves

  18. Precision vs. Recall • Precision • # of true positives / (# true positives + # false positives) • Recall • # of true positives / (# true positives + # false negatives) • A precise classifier is selective • A classifier with high recall is inclusive

  19. Precision-Recall curves Measure Precision vs Recall as the classification boundary is tuned Recall Better learningperformance Precision

  20. Precision-Recall curves Measure Precision vs Recall as the classification boundary is tuned Which learner is better? Recall Learner B Learner A Precision

  21. Area Under Curve AUC-PR: measure the area under the precision-recall curve Recall AUC=0.68 Precision

  22. AUC metrics • A single number that measures “overall” performance across multiple thresholds • Useful for comparing many learners • “Smears out” PR curve • Note training / testing set dependence

  23. Model Selection and Regularization

  24. Complexity Vs. Goodness of Fit • More complex models can fit the data better, but can overfit • Model selection: enumerate several possible hypothesis classes of increasing complexity, stop when cross-validated error levels off • Regularization: explicitly define a metric of complexity and penalize it in addition to loss

  25. Model Selection with k-fold Cross-Validation • Parameterize learner by a complexity level C • Model selection pseudocode: • For increasing levels of complexity C: • errT[C],errV[C] = Cross-Validate(Learner,C,examples)[average k-fold CV training error, testing error] • If errT has converged, • Find value Cbest that minimizes errV[C] • Return Learner(Cbest,examples) Needed capacity reached

  26. Model Selection: Decision Trees • C is max depth of decision tree. Suppose N attributes • For C=1,…,N: • errT[C],errV[C] = Cross-Validate(Learner,C, examples) • If errT has converged, • Find value Cbest that minimizes errV[C] • Return Learner(Cbest,examples)

  27. Model Selection: Feature selection example • Have many potential features f1,…,fN • Complexity level C indicates number of features allowed for learning • For C = 1,…,N • errT[C],errV[C] = Cross-Validate(Learner, examples[f1,..,fC]) • If errT has converged, • Find value Cbest that minimizes errV[C] • Return Learner(Cbest,examples)

  28. Benefits / Drawbacks • Automatically chooses complexity level to perform well on hold-out sets • Expensive: many training / testing iterations • [But wait, if we fit complexity level to the testing set, aren’t we “peeking?”]

  29. Regularization • Let the learner penalize the inclusion of new features vs. accuracy on training set • A feature is included if it improves accuracy significantly, otherwise it is left out • Leads to sparser models • Generalization to test set is considered implicitly • Much faster than cross-validation

  30. Regularization • Minimize: • Cost(h) = Loss(h) +  Complexity(h) • Example with linear models y = Tx: • L2 error: Loss() = i (y(i)-Tx(i))2 • Lq regularization: Complexity(): j|j|q • L2 and L1 are most popular in linear regularization • L2regularization leads to simple computation of optimal  • L1 is more complex to optimize, but produces sparse models in which many coefficients are 0!

  31. Data Dredging • As the number of attributes increases, the likelihood of a learner to pick up on patterns that arise purely from chance increases • In the extreme case where there are more attributes than datapoints (e.g., pixels in a video), even very simple hypothesis classes can overfit • E.g., linear classifiers • Sparsity important to enforce • Many opportunities for charlatans in the big data age!

  32. Issues in Practice • The distinctions between learning algorithms diminish when you have a lot of data • The web has made it much easier to gather large scale datasets than in early days of ML • Understanding data with many more attributes than examples is still a major challenge! • Do humans just have really great priors?

  33. Next Lectures • Intelligent agents (R&N Ch 2) • Markov Decision Processes • Reinforcement learning • Applications of AI: computer vision, robotics

More Related