620 likes | 782 Views
CS b351 Statistical Learning. Agenda. Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE) Priors, maximum a posteriori estimation (MAP) Bayesian estimation. Learning Coin Flips.
E N D
Agenda • Learning coin flips, learning Bayes net parameters • Likelihood functions, maximum likelihood estimation (MLE) • Priors, maximum a posteriori estimation (MAP) • Bayesian estimation
Learning Coin Flips • Observe that c out of N draws are cherries (data) • Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag • “Intuitive” parameter estimate: empirical distribution P(cherry) c / N(Why is this reasonable? Perhaps we got a bad draw!)
Learning Coin Flips • Observe that c out of N draws are cherries (data) • Let the unknown fraction of cherries be q (hypothesis) • Probability of drawing a cherry is q • Assumption: draws are independent and identically distributed (i.i.d)
Learning Coin Flips • Probability of drawing a cherry is q • Assumption: draws are independent and identically distributed (i.i.d) • Probability of drawing 2 cherries is q*q • Probability of drawing 2 limes is (1-q)2 • Probability of drawing 1 cherry and 1 lime: q*(1-q)
Likelihood Function • Likelihood: the probability of the data d={d1,…,dN} given the hypothesis q • P(d|q) = PjP(dj|q) i.i.d assumption
Likelihood Function • Likelihood: the probability of the data d={d1,…,dN} given the hypothesis q • P(d|q) = PjP(dj|q) = Pj qif dj=Cherry 1-q if dj=Lime Probability model, assuming q is given
Likelihood Function • Likelihood: the probability of the data d={d1,…,dN} given the hypothesis q • P(d|q) = PjP(dj|q) = Pj= qc(1-q)N-c qif dj=Cherry 1-q if dj=Lime Gather c cherry terms together, then N-c lime terms
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Peaks of likelihood function seem to hover around the fraction of cherries… • Sharpness indicates some notion of certainty…
Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=1 is MLE
Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=1 is MLE
Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=2/3 is MLE
Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=1/2 is MLE
Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxqP(d|q) is known as the maximum likelihood estimate (MLE) q=2/5 is MLE
Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = log [ qc(1-q)N-c]
Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]
Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]= c log q + (N-c) log (1-q)
Proof: Empirical Frequency is the MLE • l(q) = log P(d|q) = c log q + (N-c) log (1-q) • Setting dl/dq(q)= 0 gives the maximum likelihood estimate
Proof: Empirical Frequency is the MLE • dl/dq(q) = c/q– (N-c)/(1-q) • At MLE, c/q – (N-c)/(1-q) = 0…=> q = c/N
Maximum Likelihood for BN • For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values N=1000 B: 200 E: 500 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380 Alarm
Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Suppose BN has a single variable X • Estimate X’s CPT, P(X) X
Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Suppose BN has a single variable X • Estimate X’s CPT, P(X) • (Just learning a coin flip as usual) • PMLE(X) = empirical distribution of D • PMLE(X=T) = Count(X=T) / M • PMLE(X=F) = Count(X=F) / M X
Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Suppose BN to the right: • Estimate P(X), P(Y|X) • Estimate PMLE(X) as usual X Y
Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Estimate PMLE(Y|X) with… X Y
Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • Estimate PMLE(Y|X) with… X Y
Fitting CPTsvia MLE • M examples D=(d[1],…,d[M]) • Each d[i] is a complete example of all variables in the Bayes net • Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN • In general, for P(Y|X1,…,Xk): • For each setting of (y,x1,…,xk): • Compute Count(y, x1,…,xk) • Compute Count(x1,…,xk) • Set X1 X2 X3 Y
Other MLE results • Categorical distributions (Non-binary discrete variables): empirical distribution is MLE • Make histogram, divide by N • Continuous Gaussian distributions • Mean = average of data • Standard deviation = standard deviation of data Histogram Gaussian (normal) distribution
Nice properties of MLE • Easy to compute (for certain probability models) • With enough data, the qMLE estimate will approach the true unknown value of q
Problems with MLE • The MLE was easy to compute… but what happens when we don’t have much data? • Motivation • You hand me a coin from your pocket • 1 flip, turns up tails • Whats the MLE?
Problems with MLE • The MLE was easy to compute… but what happens when we don’t have much data? • Motivation • You hand me a coin from your pocket • 1 flip, turns up tails • Whats the MLE? • qMLE has a high variance with small sample sizes
Variance of an Estimator: Intuition • The dataset D is just a sample of the underlying distribution, and if we could “do over” the sample, then we might get a new dataset D’. • With D’, our MLE estimate qMLE’ might be different • How much? How often? • Assume all values of q are equally likely • In the case of 1 draw, D would have just as likely been a Lime. In that case, qMLE= 0 • So with probability 0.5, qMLE would be 1, and with the same probability, qMLE would be 0. • High variance: typical “do overs” give drastically different results!
An Alternative approach: Bayesian Estimation • P(D|q) is the likelihood • P(q) is the hypothesis prior • P(q|D) = 1/Z P(D|q) P(q) is the posterior • Distribution of hypotheses given the data q d[1] d[2] d[M]
Bayesian Prediction • For a new draw Y: use hypothesis posterior to predict P(Y|D) q Y d[1] d[2] d[M]
Candy Example • Candy comes in 2 flavors, cherry and lime, with identical wrappers • Manufacturer makes 5 indistinguishable bags • Suppose we draw • What bag are we holding? What flavor will we draw next? h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%
Bayesian Learning • Main idea: Compute the probability of each hypothesis, given the data • Data D: • Hypotheses: h1,…,h5 h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%
Bayesian Learning • Main idea: Compute the probability of each hypothesis, given the data • Data D: • Hypotheses: h1,…,h5 P(hi|D) We want this… P(D|hi) But all we have is this! h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%
Using Bayes’ Rule • P(hi|D) = a P(D|hi) P(hi) is the posterior • (Recall, 1/a = P(D) = SiP(D|hi) P(hi)) • P(D|hi) is the likelihood • P(hi) is the hypothesis prior h1C: 100%L: 0% h2C: 75%L: 25% h3C: 50%L: 50% h4C: 25%L: 75% h5C: 0%L: 100%
P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1 P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90 Sum = 1/a = 0.1114 Computing the Posterior • Assume draws are independent • Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) • D= { 10 x } P(D|h1) = 0 P(D|h2) = 0.2510 P(D|h3) = 0.510 P(D|h4) = 0.7510P(D|h5) = 110
Predicting the Next Draw H • P(Y|d) = SiP(Y|hi,D)P(hi|D) = SiP(Y|hi)P(hi|D) D Y Probability that next candy drawn is a lime P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90 P(Y|h1) =0P(Y|h2) =0.25P(Y|h3) =0.5P(Y|h4) =0.75P(Y|h5) =1 P(Y|D) = 0.975