410 likes | 425 Views
2D1431 Machine Learning. Bayesian Learning. Outline. Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes classifier Bayes optimal classifier Bayesian belief networks Expectation maximization (EM) algorithm.
E N D
2D1431 Machine Learning Bayesian Learning
Outline • Bayes theorem • Maximum likelihood (ML) hypothesis • Maximum a posteriori (MAP) hypothesis • Naïve Bayes classifier • Bayes optimal classifier • Bayesian belief networks • Expectation maximization (EM) algorithm
Literature & Software • T. Mitchell: chapter 6 • S. Russell & P. Norvig, “Artificial Intelligence – A Modern Approach” : chapters 14+15 • R.O. Duda, P.E. Hart, D.G. Stork, “Pattern Classification 2nd ed.” : chapters 2+3 • David Heckerman: “A Tutorial on Learning with Bayesian Belief Networks” http://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf • Bayes Net Toolbox for Matlab (free), Kevin Murphy http://www.cs.berkeley.edu/~murphyk/Bayes/bnt.html
Bayes Theorem P(h|D) = P(D|h) P(h) / P(D) • P(D) : prior probability of the data D, evidence • P(h) : prior probability of the hypothesis h, prior • P(h|D) : posterior probability of the hypothesis given the data D, posterior • P(D|h) : probability of the data D given the hypothesis h , likelihood of the data
Bayes Theorem P(h|D) = P(D|h) P(h) / P(D) posterior = likelihood x prior / evidence • By observing the data D we can convert the prior probability P(h) to the a posteriori probability (posterior) P(h|D) • The posterior is probability that h holds after data D has been observed. • The evidence P(D) can be viewed merely as a scale factor that guarantees that the posterior probabilities sum to one.
Choosing Hypotheses P(h|D) = P(D|h) P(h) / P(D) • Generally want the most probable hypothesis given the training data • Maximum a posteriori hypothesis hMAP • hMAP = argmaxhH P(h|D) = argmaxhH P(D|h) P(h) / P(D) = argmaxhH P(D|h) P(h) • If the priors of hypothesis are equally likely P(hi)=P(hj) then one can choose the maximum likelihood (ML) hypothesis hML = argmaxhH P(D|h)
Bayes Theorem Example A patient takes a lab test and the result is positive. The test returns a correct positive () result in 98% of the cases in which the disease is actually present, and a correct negative () result in 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have the disease. Hypotheses : disease, ¬disease priors P(h) : P(disease) = 0.008, P(¬ disease)=0.992 likelihoods P(D|h): P(|disease)=0.98, P(|disease)=0.02 P(|¬disease)=0.03, P(|¬disease)=0.97 Maximum posteriors argmax P(h|D): P(disease|)~ P(|disease)P(disease)=0.0078 P(¬ disease|)~ P(|¬disease) P(¬ disease) = 0.0298 P(disease|) = 0.0078/(0.0078+0.0298) = 0.21 P(¬ disease|) = 0.0298/(0.0078+0.0298) = 0.79
Basic Formula for Probabilities • Product rule: P(AB) = P(A) P(B) • Sum rule: P(AB) = P(A) + P(B) - P(AB) • Theorem of total probability: if A1, A2, …, An are mutually exclusive events Si P(Ai) = 1, then P(B) = Si P(B|Ai) P(Ai)
Bayes Theorem Example P(x1,x2|m1,m2,s) = 1/(2ps) exp -Si (xi-mi)2/2s2 h={m1,m2,s} D={x1,…,xm}
Gaussian Probability Function • P(D|m1,m2,s) = Pm P(xm|m1,m2,s) • Maximum likelihood hypothesis hML hML = argmax m1,m2,s P(D|m1,m2,s) • Trick: maximize log-likelihood log P(D|m1,m2,s) = Sm log P(xm|m1,m2,s) = Sm log (1/(2ps) exp -Si (xmi-mi)2/2s2 = -M log (2ps) - Sm Si (xmi-mi)2/2s2
Gaussian Probability Function log P(D|m1,m2,s)/ mi = 0 Sm xmi-mi= 0 mi ML = 1/M Sm xmi = E[xm] log P(D|m1,m2,s)/ s = 0 sML = SmSi (xmi-mi)2 / 2M = E[(Si (xmi-mi)2] / 2 Maximum likelihood hypothesis hML = {miML,sML}
Maximum Likelihood Hypothesis • mML= (0.20, -0.14) sML = 1.42
Bayes Decision Rule • x = examples of class c1 • o = examples of class c2 {m2,s2} {m1,s1}
Bayes Decision Rule • Assume we have two Gaussians distributions associated to two separate classes c1, c2. • P(x|ci) = P(x|mi,si)= 1/(2ps) exp -Si (xi-mi)2/2s2 • Bayes decision rule (max posterior probability) Decide c1 if P(c1|x) > P(c2|x) otherwise decide c2. • if P(c1) = P(c2) use maximum likelihood P(x|ci) • else use maximum posterior P(ci|x) = P(x|ci) P(ci)
Bayes Decision Rule c2 c1
Two-Category Case • Discriminant functions: if g(x) > 0 then c1 else c2 • g(x) = P(c1|x) – P(c2|x) = P(x|c1) P(c1) - P(x|c1) P(c1) • g(x) = log P(c1|x) – log P(c2|x) = log P(x|c1)/P(x|c2) - log P(c1)/ P(c2) • Gaussian probability functions with identical si g(x) = (x-m2)2/2s2 - (x-m1)2/2s2 + log P(c1) – log P(c2) decision surface is a line/hyperplane
Learning a Real Valued Function f hML e • Consider a real-valued target function f • Noisy training examples <xi,di> di = f(xi) + ei ei is a random variable drawn from a Gaussian distribution with zero mean. • The maximum likelihood hypothesis hML is the one that minimizes the squared sum of errors hML = argmin hHSi (di – h(xi))2
Learning a Real Valued Function hML = argmax hH P(D|h) = argmax hHPi P(xi|h) = argmax hHPi (2ps)-0.5 exp -(di-h(xi))2/2s2 • maximizing logarithm log P(D|h) hML = argmax hHSi –0.5 log(2ps) -(di-h(xi))2/2s2 = argmax hHSi-(di - h(xi))2 = argmin hHSi (di – h(xi))2
Learning to Predict Probabilities • Predicting survival probability of a patient • Training examples <xi,di> where di is 0 or 1 • Objective: train a neural network to output a probability h(xi) = p(di=1) given xi • Maximum likelihood hypothesis: hML = argmax hHSi di ln h(xi) + (1-di) ln (1-h(xi)) maximize cross entropy between di and h(xi) • Weight update rule for synapses wk to output neuron h(xi) wk = wk + Si (di-h(xi)) xk • Compare to standard BP weight update rule wk = wk + Si h(xi)(1-h(xi)) (di-h(xi)) xk
Most Probable Classification • So far we sought the most probable hypothesis hMAP? • What is most probable classification of a new instance x given the data D? hMAP(x) is not the most probable classification, although often a sufficiently good approximation of it. • Consider three possible hypotheses: • P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 • Given a new instance x, h1(x)=+, h2(x)=-, h3(x)=- hMAP(x) = h1(x) = + • most probable classification: P(+)=P(h1|D)=0.4 P(-)=P(h2|D) + P(h3|D) = 0.6
Bayes Optimal Classifier • cmax = argmax cjCShiH P(cj|hi) P(hi|D) • Example: P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 P(+|h1)=1, P(-|h1)=0 P(+|h2)=0, P(-|h2)=1 P(+|h3)=0, P(-|h3)=1 therefore ShiH P(+|hi) P(hi|D) = 0.4 ShiH P(- |hi) P(hi|D) = 0.6 argmax cjCShiH P(vj|hi) P(hi|D) = -
MAP vs. Bayes Method • The maximum posterior hypothesis estimates a point hMAP in the hypothesis space H. • Bayes method instead estimates and uses a complete distribution P(h|D). • The difference appears when inference MAP or Bayes method are used for inference of unseen instances and one compares the distributions P(x|D) • MAP: P(x|D) = hMAP(x) with hML = argmax hH P(h|D) • Bayes: P(x|D) = ShiH P(x|hi) P(hi|D) • For reasonable prior distributions P(h) MAP and Bayes solution are equivalent in the asymptotic limit of infinite training data D.
Naïve Bayes Classifier • popular, simple learning algorithm • moderate or large training set available • assumption: attributes that describe instances are conditionally independent given classification (in practice works surprisingly well even if assumption is violated) • Applications: • diagnosis • text classification (newsgroup articles 20 newsgroups, 1000 documents per newsgroup, classification accuracy 89%)
Naïve Bayes Classifier • Assume discrete target function F: XC, where each instance x described by attributes <a1,a2,…,an> • Most probable value of f(x) is: cMAP= argmax cjC P(cj| <a1,a2,…,an>) = argmax cjC P(<a1,a2,…,an>|cj) P(cj) / P(<a1,a2,…,an>) = argmax cjC P(<a1,a2,…,an>|cj) P(cj) • Naïve Bayes assumption: P(<a1,a2,…,an>|cj) = Pi P(ai|cj) cNB = argmax cjC P(cj) Pi P(ai|cj)
Naïve Bayes Learning Algorithm Naïve_Bayes_Learn(examples) for each target value cj estimate P(cj) for each attribute value ai estimate of each attribute a estimate P(ai|cj) Classify_New_Instance(x) cNB = argmax cjC P(cj) Paix P(ai|cj)
Naïve Bayes Example • Consider PlayTennis and new instance <Outlook=Sunny, Temp=cool, Humidity=high, Wind=strong> • Compute cNB = argmax cjC P(cj) Paix P(ai|cj) playtennis (9+,5-) P(yes) = 9/14, P(no) = 5/14 wind=strong (3+,3-) P(strong|yes) = 3/9 , P(strong|no) 3/5 … P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes)= 0.005 P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no)= 0.021
Estimating Probabilities • What if none (nc=0) of the training instances with target value cj have attribute ai? P(ai|cj) = nc/n = 0 and P(cj) Paix P(ai|cj) = 0 • Solution: Bayesian estimate for P(ai|cj) • P(ai|cj) = (nc + mp)/(n + m) • n : number of training examples for which c=cj • nc : number of examples for which c=cj and a=ai • p : prior estimate of P(ai|cj) • m : weight given to prior (number of “virtual” examples)
Bayesian Belief Networks • naïve assumption of conditional independency too restrictive • full probability distribution intractable due to lack of data • Bayesian belief networks describe conditional independence among subsets of variables • allows combining prior knowledge about causal relationships among variables with observed data
Conditional Independence Definition: X is conditionally independent of Y given Z is the probability distribution governing X is independent of the value of Y given the value of Z, that is, if xi,yj,zkP(X=xi|Y=yj,Z=zk) = P(X=xi|Z=zk) or more compactly P(X|Y,Z) = P(X|Z) Example: Thunder is conditionally independent of Rain given Lightning P(Thunder |Rain, Lightning) = P(Thunder |Lightning) Notice: P(Thunder |Rain) P(Thunder) Naïve Bayes uses conditional independence to justify: P(X,Y|Z) = P(X|Y,Z) P(Y|Z) = P(X|Z) P(Y|Z)
Bayesian Belief Network Storm BusTour Group Campfire Network represents a set of conditional independence assertions: • Each node is conditionally independent of its non-descendants, given its immediate predecessors. (directed acyclic graph) Lightning Campfire Forestfire Thunder
Bayesian Belief Network Storm BusTour Group Campfire Network represents joint probability distribution over all variables • P(Storm,BusGroup,Lightning,Campfire,Thunder,Forestfire) • P(y1,…,yn) = Pi=1n P(yi|Parents(Yi)) • joint distribution is fully defined by graph plus P(yi|Parents(Yi)) Lightning Campfire Thunder Forestfire P(C|S,B)
Expectation Maximization EM when to use • data is only partially observable • unsupervised clustering: target value unobservable • supervised learning: some instance attributes unobservable applications • training Bayesian Belief Networks • unsupervised clustering • learning hidden Markov models
Generating Data from Mixture of Gaussians Each instance x generated by • choosing one of the k Gaussians at random • Generating an instance according to that Gaussian
EM for Estimating k Means Given: • instances from X generated by mixture of k Gaussians • unknown means <m1,…,mk> of the k Gaussians • don’t know which instance xi was generated by which Gaussian Determine: • maximum likelihood estimates of <m1,…,mk> Think of full description of each instance as yi=<xi,zi1,zi2> • zij is 1 if xi generated by j-th Gaussian • xi observable • zij unobservable
EM for Estimating k Means EM algorithm: pick random initial h=<m1,m2> then iterate • E step: Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis h=<m1,m2> holds. E[zij] = p(x=xi|m=mj) / Sn=12 p(x=xi|m=mj) = exp(-(xi-mj)2/2s2) / Sn=12 exp(-(xi-mn)2/2s2) • M step: Calculate a new maximum likelihood hypothesis h’=<m1’,m2’> assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated in the E-step. Replace h=<m1,m2> by h’=<m1’,m2’> mj = Si=1m E[zij] xi / Si=1m E[zij]
EM Algorithm Converges to local maximum likelihood and provides estimates of hidden variables zij. In fact local maximum in E [ln (P(Y|h)] • Y is complete (observable plus non-observable variables) data • Expected valued is taken over possible values of unobserved variables in Y
General EM Problem Given: • observed data X = {x1,…,xm} • unobserved data Z = {z1,…,zm} • parameterized probability distribution P(Y|h) where • Y = {y1,…,ym} is the full data yi=<xi,zi> • h are the parameters Determine: • h that (locally) maximizes E[ln P(Y|h)] Applications: • train Bayesian Belief Networks • unsupervised clustering • hidden Markov models
General EM Method Define likelihood function Q(h’|h) which calculates Y = X Z using observed X and current parameters h to estimate Z Q(h’|h) = E[ ln( P(Y|h’) | h, X] EM algorithm: Estimation (E) step: Calculate Q(h’|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h’|h) = E[ ln( P(Y|h’) | h, X] Maximization (M) step: Replace hypothesis h by the hypothesis h’ that maximizes this Q function. h = argmaxh’H Q(h’|h)