390 likes | 629 Views
Statistical Learning Methods. Marco Loog. Introduction. Agents can handle uncertainty by using the methods of probability and decision theory But first they must learn their probabilistic theories of the world from experience. Key Concepts.
E N D
Statistical Learning Methods Marco Loog
Introduction • Agents can handle uncertainty by using the methods of probability and decision theory • But first they must learn their probabilistic theories of the world from experience...
Key Concepts • Data : evidence, i.e., instantiation of one or more random variables describing the domain • Hypotheses : probabilistic theories of how the domain works
Outline • Bayesian learning • Maximum a posteriori and maximum likelihood learning • Instance-based learning • Neural networks
Bayesian Learning • Let D be all data, with observed value d, then probability of a hypothesis hi, using Bayes rule : P(hi|d) = aP(d|hi)P(hi) • For prediction about quantity X : P(X|d)= ∑ P(X|d,hi)P(hi|d)= ∑ P(X|hi)P(hi|d)
Bayesian Learning • For prediction about quantity X : P(X|d)= ∑ P(X|d,hi)P(hi|d)= ∑ P(X|hi)P(hi|d) • No single best-guess hypothesis
Bayesian Learning • Simply calculates probability of each hypothesis, given data, and makes predictions based on this • I.e., predictions based on all hypothesis, weighted by their probabilities, rather than using only ‘single best’ hypothesis
Candy • Suppose five kinds of bags of candies • 10% are h1 : 100% cherry candies • 20% are h2 : 75% cherry candies + 25% lime candies • 40% are h3 : 50% cherry candies + 50% lime candies • 20% are h4 : 25% cherry candies + 75% lime candies • 10% are h5 : 100% lime candies • We observe candies drawn from some bag
Mo’ Candy • We observe candies drawn from some bag • Assume observations are i.i.d.,e.g. because many candies in the bag • Assume we don’t like the green lime candy • Important questions • What kind of bag is it? h1, h2,...,h5? • What flavor will the next candy be?
True hypothesis will eventually dominate the Bayesian prediction [prior is of no influence in the long run] More importantly [maybe not for us?] : Bayesian prediction is optimal Posterior Probability of Hypotheses
The Price for Being Optimal • For real learning problems the hypothesis space is large, possibly infinite • Summation / integration over hypothesis cannot be carried out • Resort to approximate or simplified methods
Maximum A Posteriori • Common approximation method : make predictions on the single most probable hypothesis • I.e. take the hi that maximizes P(hi|d) • Such a MAP hypothesis is approximately Bayesian, i.e., P(X|d) ≈ P(X|hi) [the more evidence the better the approximation]
Hypothesis Prior • Both in Bayesian learning and in MAP learning, hypothesis prior plays an important role • If hypothesis space is too expressive overfitting can occur [see also Chapter 18] • Prior is used to penalize complexity [instead of explicitly limiting the space] : the more complex the hypothesis the lower the prior probability • If enough evidence available, eventually complex hypothesis chosen [if necessary]
Maximum Likelihood Approximation • For enough data, prior becomes irrelevant • Maximum likelihood [ML] learning : choose h that maximizes P(d|hi) • I.e., simply get the best fit to the data • Identical to MAP for uniform prior P(hi) • Also reasonable if all hypotheses are of the same complexity • ML is the ‘standard’ [non-Bayesian / ‘classical’] statistical learning method
E.g. • Bag from new manufacturer; fraction of red cherry candies; any is possible • Suppose unwrap N candies, c cherries and l = N - c limes • Likelihood • Maximize for using log likelihood
E.g. 2 • Gaussian model [often denoted by N(µ,)] • Log likelihood is given by • If is known, find maximum likelihood for µ • If µ is known, find maximum likelihood for
Halfway Summary and Additional Remarks • Full Bayesian learning gives best possible predictions but is intractable • MAP selects single best hypothesis; prior is still used • Maximum likelihood assumes uniform prior, OK for large data sets • Choose parameterized family of models to describe the data • Write down likelihood of data as function of parameters • Write down derivative of log likelihood w.r.t. each parameter • Find parameter values such that the derivatives are zero • ML estimation may be hard / impossible; modern optimization techniques help • In games, data often becomes available sequentially; not necessary to train in one go
Outline • Bayesian learning √ • Maximum a posteriori and maximum likelihood learning √ • Instance-based learning • Neural networks
Instance-Based Learning • So far we saw statistical learning as parameter learning, i.e., given a specific parameter-dependent family of probability models fit it to the data by tweaking parameters • Often simple and effective • Fixed complexity • Maybe good for very little data
Instance-Based Learning • So far we saw statistical learning as parameter learning • Nonparametric learning methods allow hypothesis complexity to grow with the data • “The more data we have, the more ‘wigglier’ the hypothesis can be”
Nearest-Neighbor Method • Key idea : properties of an input point x are likely to be similar to points in the neighborhood of x • E.g. classification : estimate unknown class of x using classes of neighboring points • Simple, but how does one define what a neighborhood is? • One solution : find the k nearest neighbors • But now the problem is how to decide what nearest is...
k Nearest-Neighbor Classification • Check the class / output label of your k neighbors and simply take [for example] # of neighbors having class label x kas the posterior probability of having class label x • When assigning a single label : take MAP!
Kernel Models • Idea : Put little density function [a kernel] in every data point and take the [normalized] sum of these • Somehow similar to kNN • Often providing comparable performance
Outline • Bayesian learning √ • Maximum a posteriori and maximum likelihood learning √ • Instance-based learning √ • Neural networks
So First... Neural Networks • According to Robert Hecht-Nielsen, a neural network is simply “a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs” Simply... • We skip the biology for now • And provide the bare basics