1 / 28

CS 570 Artificial Intelligence Chapter 20. Bayesian Learning

CS 570 Artificial Intelligence Chapter 20. Bayesian Learning. Jahwan Kim Dept. of CS, KAIST. Contents. Bayesian Learning Bayesian inference MAP and ML Na ï ve Bayes method Bayesian network Parameter Learning Examples Regression and LMS EM Algorithm Algorithm Mixture of Gaussian.

leola
Download Presentation

CS 570 Artificial Intelligence Chapter 20. Bayesian Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 570 Artificial IntelligenceChapter 20. Bayesian Learning Jahwan Kim Dept. of CS, KAIST Jahwan Kim – CS 570 Artificial Intelligence

  2. Contents • Bayesian Learning • Bayesian inference • MAP and ML • Naïve Bayes method • Bayesian network • Parameter Learning • Examples • Regression and LMS • EM Algorithm • Algorithm • Mixture of Gaussian Jahwan Kim – CS 570 Artificial Intelligence

  3. Bayesian Learning • Let h1,…,hn be possible hypotheses. • Let d=(d1,…dn)be the observed data vectors. • Often (always) iid assumption is made. • Let X denote the prediction. • In Bayesian Learning, • Compute the probability of each hypothesis given the data. Predict based on that basis. • Predictions are made by using all hypotheses. • Learning in Bayesian setting is reduced to probabilistic inference. Jahwan Kim – CS 570 Artificial Intelligence

  4. Bayesian Learning • The probability that the prediction is X, when the data d is observed is P(X|d)=åi P(X|d, hi)P(hi|d) =åi P(X|hi)P(hi|d) • Prediction is weighted average over the predictions of individual hypothesis. • Hypotheses are intermediaries between the data and the predictions. • Requires computing P(hi|d) for all i. This is usually intractable. Jahwan Kim – CS 570 Artificial Intelligence

  5. Bayesian Learning BasicsTerms • P(hi|d) is called posterior (or a posteriori) probability. • Using Bayes’ rule, P(hi|d)/ P(d|hi)P(hi) • P(hi) is called the (hypothesis) prior. • We can embed knowledge by means of prior. • It also controls the complexity of the model. • P(d|hi) is called the likelihood of the data. • Under iid assumption, P(d|hi)=Õj P(dj|hi). • Let hMAP be the hypothesis for which the posterior probability P(hi|d) is maximal. It is called the maximum a posteriori (or MAP) hypothesis. Jahwan Kim – CS 570 Artificial Intelligence

  6. Bayesian Learning BasicsMAP Approximation • Since calculating the exact probability is often impractical, we use approximation by MAP hypothesis. That is, P(X|d)¼P(X|hMAP). • MAP is often easier than the full Bayesian method, because instead of large summation (integration), an optimization problem can be solved. Jahwan Kim – CS 570 Artificial Intelligence

  7. Bayesian Learning BasicsMDL Principle • Since P(hi|d)/ P(d|hi)P(hi), instead of maximizing P(hi|d), we may maximize P(d|hi)P(hi). • Equivalently, we may minimize –log P(d|hi)P(hi)=-log P(d|hi)-log P(hi). • We can interpret this as choosing hi to minimize the number of bits that is required to encode the hypothesis hi and the data d under that hypothesis. • The principle of minimizing code length (under some pre-determined coding scheme) is called the minimum description length (or MDL) principle. • MDL is used in wide range of practical machine learning applications. Jahwan Kim – CS 570 Artificial Intelligence

  8. Bayesian Learning BasicsMaximum Likelihood • Assume furthermore that P(hi)’s are all equal, i.e., assume the uniform prior. • It is a reasonable approach when there is no reason to prefer one hypothesis over another a priori. • In that case, to obtain MAP hypothesis, it suffices to maximize P(d|hi), the likelihood. Such hypothesis is called the maximum likelihood hypothesis hML. • In other words, MAP and uniform prior , ML Jahwan Kim – CS 570 Artificial Intelligence

  9. Bayesian Learning BasicsCandy Example • Two flavors of candy, cherry and lime. • Each piece of candy is wrapped in the same opaque wrapper. • Sold in verylarge bags, of which there are known to be five kinds: h1: 100% cherry, h2: 75% cherry + 25% lime, h3: 50-50, h4: 25-75, h5: 100% lime • Priors known: P(h1),…,P(h5) are 0.1, 0.2, 0.4, 0.2, 0.1 • Suppose from a bag of candy, we took N pieces of candy and all of them were lime (data dN). What are posterior probabilities P(hi|dN)? Jahwan Kim – CS 570 Artificial Intelligence

  10. Bayesian Learning BasicsCandy Example • P(h1|dN) / P(dN|h1)P(h1)=0,P(h2|dN) / P(dN|h2)P(h2)= 0.2(.25)N,P(h3|dN) / P(dN|h3)P(h3)=0.4(.5)N,P(h4|dN) / P(dN|h4)P(h4)=0.2(.75)N,P(h5|dN) / P(dN|h5)P(h5)=P(h5)=0.1. • Normalize them by requiring them to sum up to 1. Jahwan Kim – CS 570 Artificial Intelligence

  11. Bayesian Learning BasicsParameter Learning • Introduce parametric probability model with parameter q. • Then the hypotheses are hq, i.e., hypotheses are parametrized. • In the simplest case, q is a single scalar. In more complex cases, q consists of many components. • Using the data d, predict the parameter q. Jahwan Kim – CS 570 Artificial Intelligence

  12. Parameter Learning ExampleDiscrete Case • A bag of candy whose lime-cherry proportions are completely unknown. • In this case we have hypotheses parametrized by the probability q of cherry. • P(d|hq)=Õj P(dj|hq)=qcherry(1-q)lime • Two wrappers, green and red, are selected according to some unknown conditional distribution, depending on the flavor. • It has three parameters: q=P(F=cherry), q1=P(W=red|F=cherry), q2=P(W=red|F=lime). P(d|hQ)= qcherry(1-q)lime q1red,cherry(1-q1)green,cherry q2red,lime(1-q2)green,lime Jahwan Kim – CS 570 Artificial Intelligence

  13. Parameter Learning ExampleSingle Variable Gaussian • Gaussian pdf on a single variable: • Suppose x1,…,xN are observed. Then the log likelihood is • We want to find m and s that will maximize this. Find where gradient is zero. Jahwan Kim – CS 570 Artificial Intelligence

  14. Parameter Learning ExampleSingle Variable Gaussian • Solving this, we find • This verifies ML agrees with our common sense. Jahwan Kim – CS 570 Artificial Intelligence

  15. Parameter Learning ExampleLinear Regression • Consider a linear Gaussian model with one continuous parent X and a continuous child Y. • Y has a Gaussian distribution whose mean depends linearly on the value of X • Y has fixed standard deviation s. • The data are (xi, yi). • Let the mean of Y be q1X+q2. • Then P(y|x) / exp(-(y-(q1X+q2))2/2s2)/s. • Maximizing the log likelihood is equivalent to minimizing E=åj (yj-(q1xj+q2))2. • This quantity is the well-known sum of squared errors. Thus in linear regression case, ML ,Least Mean-Square (LMS) Jahwan Kim – CS 570 Artificial Intelligence

  16. Parameter Learning ExampleBeta Distribution • Candy example revisited. • q is the value of a random variable Qin Bayesian view. • P(Q) is a continuous distribution. • Uniform density is one candidate. • Another possibility is to use beta distributions. • Beta distribution has two hyperparameters a and b, and is given by (a normalizing constant) ba,b(q)=aqa-1(1-q)b-1. • Has mean a/(a+b). • More peaked when a+b is large, suggesting greater certainty about the value of Q. Jahwan Kim – CS 570 Artificial Intelligence

  17. Parameter Learning ExampleBeta Distribution • Beta distribution has nice property that if Q has a prior ba,b, then the posterior distribution for Q is also a beta distribution. • P(q|d=cherry) / P(d=cherry|q)P(q) /q ba,b(q) /q¢qa-1(1-q)b-1 / qa(1-q)b-1 / ba+1,b • Beta distribution is called the conjugate prior for the family of distributions for a Boolean variable. Jahwan Kim – CS 570 Artificial Intelligence

  18. Naïve Bayes Method • Attributes (components of observed data) are assumed to be indepdendent in Naïve Bayes Method. • Works well for about 2/3 of real-world problems, despite naivete of such assumption. Goal: Predict the class C, given the observed data Xi=xi. • By the independent assumption, P(C|x1,…xn)/ P(C)Õi P(xi|C) • We choose the most likely class. • Merits of NB • Scales well: No search is required. • Robust against noisy data. • Gives probabilistic predictions. Jahwan Kim – CS 570 Artificial Intelligence

  19. Bayesian Network • Combine all observations according to their dependency relations. • More formally, a Bayesian Network consists of the following: • A set of variables (nodes) • A set of directed edges between variables • The graph is assumed to be acyclic (i.e., there’s no directed cycle). • To each variable A with parents B1,…,Bn, there is attached the potential table P(A|B1,…,Bn). Jahwan Kim – CS 570 Artificial Intelligence

  20. Bayesian Network • A compact representation of the joint probability table (distribution) • Without dependency relation, the joint probability is intractable. Examples of Bayesian Network Jahwan Kim – CS 570 Artificial Intelligence

  21. Issues in Bayesian Network • Learning the structure: No systematic method exists. • Updating the network after observation is also hard: NP-hard in general. • There are algorithms to overcome this computational complexity. • Hidden (latent) variables can simplify the structure substantially. Jahwan Kim – CS 570 Artificial Intelligence

  22. EM Algorithm:Learning with Hidden Variables • Latent (hidden) variables are not directly observable. • Latent variables are everywhere, in HMM, mixture of Gaussians, Bayesian Networks, … • EM (Expectation-Maximization) Algorithm solves the problem of learning parameters in the presence of latent variables • In a very general way • Also in a very simple way. • EM algorithm is an iterative algorithm: • It iterates over E- and M-steps repeatedly, updating the parameter at each step. Jahwan Kim – CS 570 Artificial Intelligence

  23. EM Algorithm • An iterative algorithm. • Let qbe the parameters of the model,q(i)be its estimated value at i-th step,Z be the hidden variable. • Expectation (E-Step) Compute expectation w.r.t the hidden variable of completed data log-likelihood function åzP(Z=z|x, q(i)) log P(x,Z=z|q) • Maximization (M-Step) Update q by maximizing this expectation: q(i+1) =arg maxq åzP(Z=z|x, q(i)) log P(x,Z=z|q) • Iterate (1)-(2) until convergence! Jahwan Kim – CS 570 Artificial Intelligence

  24. EM Algorithm • Resembles gradient-descent algorithm, but no step-size parameter. • EM increases log likelihood at every step. • May have problems in convergence. • Several variants of EM algorithm are suggested to overcome such difficulties. • Putting priors, different initialization, and reasonable initial values all help. Jahwan Kim – CS 570 Artificial Intelligence

  25. EM Algorithm Prototypical ExampleMixture of Gaussians • A mixture distribution P(X)=åi=1k P(C=i) P(X|C=i) • P(X|C=i) is a distribution for i-th component. • When each P(X|C=i) is (multivariate) Gaussian, this distribution is called a mixture of Gaussians. • Has the following parameters: • Weight wi=P(C=i) • Means mi • Covariances Si • Problem in learning parameters: we don’t know which component generated each data points. Jahwan Kim – CS 570 Artificial Intelligence

  26. EM Algorithm Prototypical ExampleMixture of Gaussians • Introduce the indicator hidden variables Z=(Zj): From which component xj was generated? • Can derive answer analytically, but it’s complicated. (See for example http://www.lans.ece.utexas.edu/course/ee380l/2002sp/blimes98gentle.pdf) • Skipping the details, the answers are as follows: • Let pij=P(C=i|xj)/ P(xj|C=i)P(C=i), pi=åj pij, wi=P(C=i). • Update miÃåj pijxj/p SiÃåj pijxjxjT/pj wià pi Jahwan Kim – CS 570 Artificial Intelligence

  27. EM Algorithm Prototypical ExampleMixture of Gaussians • For nice “look-and-feel” demo of EM algorithms on mixture of Gaussians, see http://www.neurosci.aist.go.jp/~akaho/MixtureEM.html Jahwan Kim – CS 570 Artificial Intelligence

  28. EM Algorithm ExampleBayesian Network, HMM • Omitted. • Covered later in class/student presentation (?) Jahwan Kim – CS 570 Artificial Intelligence

More Related