1 / 35

Statistical Learning Methods

Statistical Learning Methods. Marco Loog. Introduction. Agents can handle uncertainty by using the methods of probability and decision theory But first they must learn their probabilistic theories of the world from experience. Key Concepts.

alder
Download Presentation

Statistical Learning Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Learning Methods Marco Loog

  2. Introduction • Agents can handle uncertainty by using the methods of probability and decision theory • But first they must learn their probabilistic theories of the world from experience...

  3. Key Concepts • Data : evidence, i.e., instantiation of one or more random variables describing the domain • Hypotheses : probabilistic theories of how the domain works

  4. Outline • Bayesian learning • Maximum a posteriori and maximum likelihood learning • Instance-based learning • Neural networks

  5. Bayesian Learning • Let D be all data, with observed value d, then probability of a hypothesis hi, using Bayes rule : P(hi|d) = aP(d|hi)P(hi) • For prediction about quantity X : P(X|d)= ∑ P(X|d,hi)P(hi|d)= ∑ P(X|hi)P(hi|d)

  6. Bayesian Learning • For prediction about quantity X : P(X|d)= ∑ P(X|d,hi)P(hi|d)= ∑ P(X|hi)P(hi|d) • No single best-guess hypothesis

  7. Bayesian Learning • Simply calculates probability of each hypothesis, given data, and makes predictions based on this • I.e., predictions based on all hypothesis, weighted by their probabilities, rather than using only ‘single best’ hypothesis

  8. Candy • Suppose five kinds of bags of candies • 10% are h1 : 100% cherry candies • 20% are h2 : 75% cherry candies + 25% lime candies • 40% are h3 : 50% cherry candies + 50% lime candies • 20% are h4 : 25% cherry candies + 75% lime candies • 10% are h5 : 100% lime candies • We observe candies drawn from some bag

  9. Mo’ Candy • We observe candies drawn from some bag • Assume observations are i.i.d.,e.g. because many candies in the bag • Assume we don’t like the green lime candy • Important questions • What kind of bag is it? h1, h2,...,h5? • What flavor will the next candy be?

  10. Posterior Probability of Hypotheses

  11. True hypothesis will eventually dominate the Bayesian prediction [prior is of no influence in the long run] More importantly [maybe not for us?] : Bayesian prediction is optimal Posterior Probability of Hypotheses

  12. The Price for Being Optimal • For real learning problems the hypothesis space is large, possibly infinite • Summation / integration over hypothesis cannot be carried out • Resort to approximate or simplified methods

  13. Maximum A Posteriori • Common approximation method : make predictions on the single most probable hypothesis • I.e. take the hi that maximizes P(hi|d) • Such a MAP hypothesis is approximately Bayesian, i.e., P(X|d) ≈ P(X|hi) [the more evidence the better the approximation]

  14. Hypothesis Prior • Both in Bayesian learning and in MAP learning, hypothesis prior plays an important role • If hypothesis space is too expressive overfitting can occur [see also Chapter 18] • Prior is used to penalize complexity [instead of explicitly limiting the space] : the more complex the hypothesis the lower the prior probability • If enough evidence available, eventually complex hypothesis chosen [if necessary]

  15. Maximum Likelihood Approximation • For enough data, prior becomes irrelevant • Maximum likelihood [ML] learning : choose h that maximizes P(d|hi) • I.e., simply get the best fit to the data • Identical to MAP for uniform prior P(hi) • Also reasonable if all hypotheses are of the same complexity • ML is the ‘standard’ [non-Bayesian / ‘classical’] statistical learning method

  16. E.g. • Bag from new manufacturer; fraction  of red cherry candies; any  is possible • Suppose unwrap N candies, c cherries and l = N - c limes • Likelihood • Maximize for  using log likelihood

  17. E.g. 2 • Gaussian model [often denoted by N(µ,)] • Log likelihood is given by • If  is known, find maximum likelihood for µ • If µ is known, find maximum likelihood for 

  18. Halfway Summary and Additional Remarks • Full Bayesian learning gives best possible predictions but is intractable • MAP selects single best hypothesis; prior is still used • Maximum likelihood assumes uniform prior, OK for large data sets • Choose parameterized family of models to describe the data • Write down likelihood of data as function of parameters • Write down derivative of log likelihood w.r.t. each parameter • Find parameter values such that the derivatives are zero • ML estimation may be hard / impossible; modern optimization techniques help • In games, data often becomes available sequentially; not necessary to train in one go

  19. Outline • Bayesian learning √ • Maximum a posteriori and maximum likelihood learning √ • Instance-based learning • Neural networks

  20. Instance-Based Learning • So far we saw statistical learning as parameter learning, i.e., given a specific parameter-dependent family of probability models fit it to the data by tweaking parameters • Often simple and effective • Fixed complexity • Maybe good for very little data

  21. Instance-Based Learning • So far we saw statistical learning as parameter learning • Nonparametric learning methods allow hypothesis complexity to grow with the data • “The more data we have, the more ‘wigglier’ the hypothesis can be”

  22. Nearest-Neighbor Method • Key idea : properties of an input point x are likely to be similar to points in the neighborhood of x • E.g. classification : estimate unknown class of x using classes of neighboring points • Simple, but how does one define what a neighborhood is? • One solution : find the k nearest neighbors • But now the problem is how to decide what nearest is...

  23. k Nearest-Neighbor Classification • Check the class / output label of your k neighbors and simply take [for example] # of neighbors having class label x kas the posterior probability of having class label x • When assigning a single label : take MAP!

  24. kNN Probability Density Estimation

  25. Kernel Models • Idea : Put little density function [a kernel] in every data point and take the [normalized] sum of these • Somehow similar to kNN • Often providing comparable performance

  26. Probability Density Estimation

  27. Outline • Bayesian learning √ • Maximum a posteriori and maximum likelihood learning √ • Instance-based learning √ • Neural networks

  28. Neural Networks and Games

  29. Neural Networks and Games

  30. Neural Networks and Games

  31. Neural Networks and Games

  32. Neural Networks and Games

  33. Neural Networks and Games

  34. Neural Networks and Games

  35. So First... Neural Networks • According to Robert Hecht-Nielsen, a neural network is simply “a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs” Simply... • We skip the biology for now • And provide the bare basics

More Related