240 likes | 455 Views
Bayesian learning finalized (with high probability). Everything’s random. Basic Bayesian viewpoint: Treat (almost) everything as a random variable Data/independent var: X vector Class/dependent var: Y Parameters : Θ E.g., mean, variance, correlations, multinomial params, etc.
E N D
Everything’s random... • Basic Bayesian viewpoint: • Treat (almost) everything as a random variable • Data/independent var: X vector • Class/dependent var: Y • Parameters:Θ • E.g., mean, variance, correlations, multinomial params, etc. • Use Bayes’ Rule to assess probabilities of classes • Allows us to say: “It is is very unlikely that the mean height is 2 light years”
Uncertainty over params • Maximum likelihood treats parameters as (unknown) constants • Job is just to pick the constants so as to maximize data likelihood • Fullblown Bayesian modeling treats params as random variables • PDF over parameter variables tells us how certain/uncertain we are about the location of that parameter • Also allows us to express prior beliefs (probabilities) about params
Example: Coin flipping • Have a “weighted” coin -- want to figure out θ=Pr[heads] • Maximum likelihood: • Flip coin a bunch of times, measure #heads; #tails • Use estimator to return a single value for θ • This is called a point estimate
Example: Coin flipping • Have a “weighted” coin -- want to figure out θ=Pr[heads] • Bayesian posterior estimation: • Start w/ distribution over what θ might be • Flip coin a bunch of times, measure #heads; #tails • Update distribution, but never reduce to a single number • Always keep around Pr[θ | data]: posterior estimate
Example: Coin flipping ? ? ? ? ? ? ? 0 flips total
Example: Coin flipping 1 flip total
Example: Coin flipping 5 flips total
Example: Coin flipping 10 flips total
Example: Coin flipping 20 flips total
Example: Coin flipping 50 flips total
Example: Coin flipping 100 flips total
How does it work? • Think of parameters as just another kind of random variable • Now your data distribution is • This is the generative distribution • A.k.a. observation distribution, sensor model, etc. • What we want is some model of parameter as a function of the data • Get there with Bayes’ rule:
What does that mean? • Let’s look at the parts: • Generative distribution • Describes how data is generated by the underlying process • Usually easy to write down (well, easier than the other parts, anyway) • Same old PDF/PMF we’ve been working with • Can be used to “generate” new samples of data that “look like” your training data
What does that mean? • The parameter prior or a priori distribution: • Allows you to say “this value of is more likely than that one is...” • Allows you to express beliefs/assumptions/ preferences about the parameters of the system • Also takes over when the data is sparse (small N) • In the limit of large data, prior should “wash out”, letting the data dominate the estimate of the parameter • Can let be “uniform” (a.k.a., “uninformative”) to minimize its impact
What does that mean? • The data prior: • Expresses the probability of seeing data set Xindependent of any particular model • Huh?
What does that mean? • The data prior: • Expresses the probability of seeing data set Xindependent of any particular model • Can get it from the joint data/parameter model: • In practice, often don’t need it explicitly (why?)
What does that mean? • Finally, the posterior (or a posteriori) distribution: • Lit., “from what comes after” (Latin) • Essentially, “What we believe about the parameter after we look at the data” • As compared to the “prior” or “a priori” (lit., “from what is before”) parameter distribution,
Example: coin flipping • A (biased) coin lands heads-up w/ prob p and tails-up w/ prob 1-p • Parameter of the system is p • Goal is to find Pr[p | sequence of coin flips] • (Technically, we want a PDF, f(p | flips)) • Q: what family of PDFs is appropriate?
Normalization constant: “Beta function” Pr[heads] Pr[tails] Example: coin flipping • We need a PDF that generates possible values of p • p∈[0,1] • Commonly used distribution is beta distribution:
The Beta Distribution Image courtesey of Wikimedia commons
Generative distribution • f(p|α,β) is the prior distribution for p • Parameters α and β are hyperparameters • Govern shape of f() • Still need the generative distribution: Pr[h,t|p] • h,t: number of heads, tails • Use a binomial distribution:
Posterior • Now, by Bayes’ rule:
Exercise • Suppose you want to estimate the average air speed of an unladen (African) swallow • Let’s say that airspeeds of individual swallows, x, are Gaussianly distributed with mean and variance 1: • Let’s say, also, that we think the mean is “around” 50 kph, but we’re not sure exactly what it is. But our uncertainty (variance) is 10 kph. • Derive the posterior estimate of the mean airspeed.