1 / 62

CS b351 Statistical Learning

CS b351 Statistical Learning. Agenda. Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE) Priors, maximum a posteriori estimation (MAP) Bayesian estimation. Learning Coin Flips.

ayasha
Download Presentation

CS b351 Statistical Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b351Statistical Learning

  2. Agenda • Learning coin flips, learning Bayes net parameters • Likelihood functions, maximum likelihood estimation (MLE) • Priors, maximum a posteriori estimation (MAP) • Bayesian estimation

  3. Learning Coin Flips • Observe that c out of N draws are cherries (data) • Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag • “Intuitive” parameter estimate: empirical distribution P(cherry)  c / N(Why is this reasonable? Perhaps we got a bad draw!)

  4. Learning Coin Flips • Observe that c out of N draws are cherries (data) • Let the unknown fraction of cherries be q (hypothesis) • Probability of drawing a cherry is q • Assumption: draws are independent and identically distributed (i.i.d)

  5. Learning Coin Flips • Probability of drawing a cherry is q • Assumption: draws are independent and identically distributed (i.i.d) • Probability of drawing 2 cherries is q*q • Probability of drawing 2 limes is (1-q)2 • Probability of drawing 1 cherry and 1 lime: q*(1-q)

  6. Likelihood Function • Likelihood: the probability of the data d={d1,…,dN} given the hypothesis q • P(d|q) = PjP(dj|q) i.i.d assumption

  7. Likelihood Function • Likelihood: the probability of the data d={d1,…,dN} given the hypothesis q • P(d|q) = PjP(dj|q) = Pj qif dj=Cherry 1-q if dj=Lime Probability model, assuming q is given

  8. Likelihood Function • Likelihood: the probability of the data d={d1,…,dN} given the hypothesis q • P(d|q) = PjP(dj|q) = Pj= qc(1-q)N-c qif dj=Cherry 1-q if dj=Lime Gather c cherry terms together, then N-c lime terms

  9. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  10. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  11. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  12. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  13. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  14. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  15. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  16. Maximum Likelihood • Peaks of likelihood function seem to hover around the fraction of cherries… • Sharpness indicates some notion of certainty…

  17. Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=1 is MLE

  18. Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=1 is MLE

  19. Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=2/3 is MLE

  20. Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=1/2 is MLE

  21. Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=2/5 is MLE

  22. Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = log [ qc(1-q)N-c]

  23. Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]

  24. Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

  25. Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = c log q + (N-c) log (1-q) • Setting dl/dq(q)= 0 gives the maximum likelihood estimate

  26. Proof: Empirical Frequency is the MLE • dl/dq(q) = c/q– (N-c)/(1-q) • At MLE, c/q – (N-c)/(1-q) = 0…=> q = c/N

  27. Maximum Likelihood for BN • For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values N=1000 B: 200 E: 500 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380 Alarm

  28. Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN

  29. Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Suppose BN has a single variable X • Estimate X’s CPT, P(X) X

  30. Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Suppose BN has a single variable X • Estimate X’s CPT, P(X) • (Just learning a coin flip as usual) • PMLE(X) = empirical distribution of D • PMLE(X=T) = Count(X=T) / M • PMLE(X=F) = Count(X=F) / M X

  31. Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Suppose BN to the right: • Estimate P(X), P(Y|X) • Estimate PMLE(X) as usual X   Y

  32. Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Estimate PMLE(Y|X) with… X  Y

  33. Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Estimate PMLE(Y|X) with… X   Y

  34. Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • In general, for P(Y|X1,…,Xk): • For each setting of (y,x1,…,xk): • Compute Count(y, x1,…,xk) • Compute Count(x1,…,xk) • Set X1 X2 X3 Y

  35. Other MLE results • Categorical distributions (Non-binary discrete variables): empirical distribution is MLE • Make histogram, divide by N • Continuous Gaussian distributions • Mean = average of data • Standard deviation = standard deviation of data Histogram Gaussian (normal) distribution

  36. Nice properties of MLE • Easy to compute (for certain probability models) • With enough data, the qMLE estimate will approach the true unknown value of q

  37. Problems with MLE • The MLE was easy to compute… but what happens when we don’t have much data? • Motivation • You hand me a coin from your pocket • 1 flip, turns up tails • Whats the MLE?

  38. Problems with MLE • The MLE was easy to compute… but what happens when we don’t have much data? • Motivation • You hand me a coin from your pocket • 1 flip, turns up tails • Whats the MLE? • qMLE has a high variance with small sample sizes

  39. Variance of an Estimator: Intuition • The dataset D is just a sample of the underlying distribution, and if we could “do over” the sample, then we might get a new dataset D’. • With D’, our MLE estimate qMLE’ might be different • How much? How often? • Assume all values of q are equally likely • In the case of 1 draw, D would have just as likely been a Lime. In that case, qMLE= 0 • So with probability 0.5, qMLE would be 1, and with the same probability, qMLE would be 0. • High variance: typical “do overs” give drastically different results!

  40. Is there a Better Way? Bayesian Learning

  41. An Alternative approach: Bayesian Estimation • P(D|q) is the likelihood • P(q) is the hypothesis prior • P(q|D) = 1/Z P(D|q) P(q) is the posterior • Distribution of hypotheses given the data q d[1] d[2] d[M]

  42. Bayesian Prediction • For a new draw Y: use hypothesis posterior to predict P(Y|D) q Y d[1] d[2] d[M]

  43. Candy Example • Candy comes in 2 flavors, cherry and lime, with identical wrappers • Manufacturer makes 5 indistinguishable bags • Suppose we draw • What bag are we holding? What flavor will we draw next? h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%

  44. Bayesian Learning • Main idea: Compute the probability of each hypothesis, given the data • Data D: • Hypotheses: h1,…,h5 h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%

  45. Bayesian Learning • Main idea: Compute the probability of each hypothesis, given the data • Data D: • Hypotheses: h1,…,h5 P(hi|D) We want this… P(D|hi) But all we have is this! h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%

  46. Using Bayes’ Rule • P(hi|D) = a P(D|hi) P(hi) is the posterior • (Recall, 1/a = P(D) = SiP(D|hi) P(hi)) • P(D|hi) is the likelihood • P(hi) is the hypothesis prior h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%

  47. P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1 P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90 Sum = 1/a = 0.1114 Computing the Posterior • Assume draws are independent • Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) • D= { 10 x } P(D|h1) = 0 P(D|h2) = 0.2510 P(D|h3) = 0.510 P(D|h4) = 0.7510P(D|h5) = 110

  48. Posterior Hypotheses

  49. Predicting the Next Draw H • P(Y|d) = SiP(Y|hi,D)P(hi|D) = SiP(Y|hi)P(hi|D) D Y Probability that next candy drawn is a lime P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90 P(Y|h1) =0P(Y|h2) =0.25P(Y|h3) =0.5P(Y|h4) =0.75P(Y|h5) =1 P(Y|D) = 0.975

  50. P(Next Candy is Lime | d)

More Related