330 likes | 435 Views
Taming the Learning Zoo. Supervised Learning Zoo. Bayesian learning Maximum likelihood Maximum a posteriori Decision trees Support vector machines Neural nets k-Nearest-Neighbors. Very approximate “cheat-sheet” for techniques Discussed in Class. What haven’t we covered?. Boosting
E N D
Supervised Learning Zoo • Bayesian learning • Maximum likelihood • Maximum a posteriori • Decision trees • Support vector machines • Neural nets • k-Nearest-Neighbors
Very approximate “cheat-sheet” for techniques Discussed in Class
What haven’t we covered? • Boosting • Way of turning several “weak learners” into a “strong learner” • E.g. used in popular random forests algorithm • Regression: predicting continuous outputs y=f(x) • Neural nets, nearest neighbors work directly as described • Least squares, locally weighted averaging • Unsupervised learning • Clustering • Density estimation • Dimensionality reduction • [Harder to quantify performance]
Agenda • Quantifying learner performance • Cross validation • Precision & recall • Model selection
Assessing Performance of a Learning Algorithm • Samples from X are typically unavailable • Take out some of the training set • Train on the remaining training set • Test on the excluded instances • Cross-validation
- - + - + - - - - + + + + - - + + + + - - - + + Cross-Validation • Split original set of examples, train Examples D Train Hypothesis space H
- - - - + + + + + + - - + Cross-Validation • Evaluate hypothesis on testing set Testing set Hypothesis space H
Cross-Validation • Evaluate hypothesis on testing set Testing set - + + - + + + Test - + + - - - Hypothesis space H
- - - - + + + + + + - - + Cross-Validation • Compare true concept against prediction 9/13 correct Testing set - + + - + + + - + + - - - Hypothesis space H
Common Splitting Strategies • k-fold cross-validation Dataset Train Test
Common Splitting Strategies • k-fold cross-validation • Leave-one-out (n-fold cross validation) Dataset Train Test
Computational complexity • k-fold cross validation requires • k training steps on n(k-1)/k datapoints • k testing steps on n/k datapoints • (There are efficient ways of computing L.O.O. estimates for some nonparametric techniques, e.g. Nearest Neighbors) • Average results reported
Bootstrapping • Similar technique for estimating the confidence in the model parameters • Procedure: • Draw k hypothetical datasets from original data. Either via cross validation or sampling with replacement. • Fit the model for each dataset to compute parameters k • Return the standard deviation of 1,…,k (or a confidence interval) Can also estimate confidence in a prediction y=f(x)
Simple Example: average of N numbers • Data D={x(1),…,x(N)}, model is constant • Learning: minimize E() = i(x(i)-)2=> compute average • Repeat for j=1,…,k : • Randomly sample subset x(1)’,…,x(N)’ from D • Learn j = 1/N i x(i)’ • Return histogram of 1,…,j
Precision vs. Recall • Precision • # of true positives / (# true positives + # false positives) • Recall • # of true positives / (# true positives + # false negatives) • A precise classifier is selective • A classifier with high recall is inclusive
Precision-Recall curves Measure Precision vs Recall as the classification boundary is tuned Recall Better learningperformance Precision
Precision-Recall curves Measure Precision vs Recall as the classification boundary is tuned Which learner is better? Recall Learner B Learner A Precision
Area Under Curve AUC-PR: measure the area under the precision-recall curve Recall AUC=0.68 Precision
AUC metrics • A single number that measures “overall” performance across multiple thresholds • Useful for comparing many learners • “Smears out” PR curve • Note training / testing set dependence
Complexity Vs. Goodness of Fit • More complex models can fit the data better, but can overfit • Model selection: enumerate several possible hypothesis classes of increasing complexity, stop when cross-validated error levels off • Regularization: explicitly define a metric of complexity and penalize it in addition to loss
Model Selection with k-fold Cross-Validation • Parameterize learner by a complexity level C • Model selection pseudocode: • For increasing levels of complexity C: • errT[C],errV[C] = Cross-Validate(Learner,C,examples)[average k-fold CV training error, testing error] • If errT has converged, • Find value Cbest that minimizes errV[C] • Return Learner(Cbest,examples) Needed capacity reached
Model Selection: Decision Trees • C is max depth of decision tree. Suppose N attributes • For C=1,…,N: • errT[C],errV[C] = Cross-Validate(Learner,C, examples) • If errT has converged, • Find value Cbest that minimizes errV[C] • Return Learner(Cbest,examples)
Model Selection: Feature selection example • Have many potential features f1,…,fN • Complexity level C indicates number of features allowed for learning • For C = 1,…,N • errT[C],errV[C] = Cross-Validate(Learner, examples[f1,..,fC]) • If errT has converged, • Find value Cbest that minimizes errV[C] • Return Learner(Cbest,examples)
Benefits / Drawbacks • Automatically chooses complexity level to perform well on hold-out sets • Expensive: many training / testing iterations • [But wait, if we fit complexity level to the testing set, aren’t we “peeking?”]
Regularization • Let the learner penalize the inclusion of new features vs. accuracy on training set • A feature is included if it improves accuracy significantly, otherwise it is left out • Leads to sparser models • Generalization to test set is considered implicitly • Much faster than cross-validation
Regularization • Minimize: • Cost(h) = Loss(h) + Complexity(h) • Example with linear models y = Tx: • L2 error: Loss() = i (y(i)-Tx(i))2 • Lq regularization: Complexity(): j|j|q • L2 and L1 are most popular in linear regularization • L2regularization leads to simple computation of optimal • L1 is more complex to optimize, but produces sparse models in which many coefficients are 0!
Data Dredging • As the number of attributes increases, the likelihood of a learner to pick up on patterns that arise purely from chance increases • In the extreme case where there are more attributes than datapoints (e.g., pixels in a video), even very simple hypothesis classes can overfit • E.g., linear classifiers • Sparsity important to enforce • Many opportunities for charlatans in the big data age!
Issues in Practice • The distinctions between learning algorithms diminish when you have a lot of data • The web has made it much easier to gather large scale datasets than in early days of ML • Understanding data with many more attributes than examples is still a major challenge! • Do humans just have really great priors?
Next Lectures • Intelligent agents (R&N Ch 2) • Markov Decision Processes • Reinforcement learning • Applications of AI: computer vision, robotics