1 / 43

Maximum Likelihood Estimator and Bayesian Inference in AI

Learn about parameter estimation, maximum likelihood estimation, and Bayesian inference in artificial intelligence. Understand the properties, limitations, and applications of MLE and Bayesian networks.

kmackey
Download Presentation

Maximum Likelihood Estimator and Bayesian Inference in AI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS498-EAReasoning in AILecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal

  2. Known Structure, Complete Data Goal: Parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y InputData

  3. Maximum Likelihood Estimator General case Observations: MH heads and MT tails Find  maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is:

  4. Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts <M1,...Mk> such that Mi is the number of times that the Y=yi in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1) (i+1,...,k)

  5. Properties of ML Estimator • Unbiased • M number of occurences of H • N total no.experiments •  = Pr(H) • Variance

  6. Sufficient Statistic for Gaussian Gaussian distribution: Rewrite as  sufficient statistics for Gaussian: s(x[i])=<1,x,x2>

  7. Maximum Likelihood Estimation MLE Principle: Choose  that maximize L(D:) Multinomial MLE: Gaussian MLE:

  8. MLE for Bayesian Networks Parameters x0, x1 y0|x0, y1|x0, y0|x1, y1|x1 Data instance tuple: <x[m],y[m]> Likelihood X Y  Likelihood decomposes into two separate terms, one for each variable

  9. MLE for Bayesian Networks Terms further decompose by CPDs: By sufficient statistics where M[x0,y0] is the number of data instances in which X takes the value x0 and Y takes the value y0 MLE

  10. MLE for Bayesian Networks Likelihood for Bayesian network  if Xi|Pa(Xi) are disjoint then MLE can be computed by maximizing each local likelihood separately

  11. MLE for Table CPD BayesNets Multinomial CPD For each value xX we get an independent multinomial problem where the MLE is

  12. MLE for Tree CPDs Assume tree CPD with known tree structure Terms for <y0,z0> and <y0,z1> can be combined Optimization can be done by leaves Y Z X Y y0 y1 Z x0|y0,x1|y0 z0 z1 x0|y1,z1,x1|y1,z1 x0|y1,z1,x1|y1,z1

  13. MLE for Tree CPD BayesNets Tree CPD T, leaves l For each value lLeaves(T) we get an independent multinomial problem where the MLE is

  14. Limitations of MLE Two teams play 10 times, and the first wins 7 of the 10 matches  Probability of first team winning = 0.7 A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses  Probability of head = 0.7 Would you place the same bet on the next game as you would on the next coin toss? We need to incorporate prior knowledge Prior knowledge should only be used as a guide

  15. Bayesian Inference Assumptions Given a fixed  tosses are independent If  is unknown tosses are not marginally independent – each toss tells us something about  The following network captures our assumptions  … X[2] X[M] X[1]

  16. Bayesian Inference Joint probabilistic model Posterior probability over   … X[2] X[M] X[1] Likelihood Prior For a uniform prior, posterior is the normalized likelihood Normalizing factor

  17. Bayesian Prediction Predict the data instance from the previous ones Solve for uniform prior P()=1 and binomial variable

  18. Example: Binomial Data Prior: uniform for  in [0,1] P() = 1  P(|D) is proportional to the likelihood L(D:) MLE for P(X=H) is 4/5 = 0.8 Bayesian prediction is 5/7 (MH,MT ) = (4,1) 0 0.2 0.4 0.6 0.8 1

  19. Dirichlet Priors A Dirichlet prior is specified by a set of (non-negative) hyperparameters 1,...k so that  ~ Dirichlet(1,...k) if where and Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment

  20. Dirichlet Priors – Example 5 Dirichlet(1,1) Dirichlet(2,2) 4.5 Dirichlet(0.5,0.5) Dirichlet(5,5) 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1

  21. Dirichlet Priors Dirichlet priors have the property that the posterior is also Dirichlet Data counts are M1,...,Mk Prior is Dir(1,...k)  Posterior is Dir(1+M1,...k+Mk) The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience Equivalent sample size = 1+…+K The larger the equivalent sample size the more confident we are in our prior

  22. Effect of Priors Prediction of P(X=H) after seeing data with MH=1/4MT as a function of the sample size 0.55 0.6 Different strength H + T Fixed ratio H / T 0.5 Fixed strength H + T Different ratio H / T 0.5 0.45 0.4 0.4 0.35 0.3 0.3 0.2 0.25 0.1 0.2 0.15 0 0 20 40 60 80 100 0 20 40 60 80 100

  23. Effect of Priors (cont.) In real data, Bayesian estimates are less sensitive to noise in the data MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) 0.7 0.6 0.5 P(X = 1|D) 0.4 0.3 0.2 N 0.1 5 10 15 20 25 30 35 40 45 50 1 Toss Result 0 N

  24. General Formulation Joint distribution over D, Posterior distribution over parameters P(D) is the marginal likelihood of the data As we saw, likelihood can be described compactly using sufficient statistics We want conditions in which posterior is also compact

  25. Conjugate Families A family of priors P(:) is conjugate to a model P(|) if for any possible dataset D of i.i.d samples from P(|) and choice of hyperparameters  for the prior over , there are hyperparameters ’ that describe the posterior, i.e.,P(:’) P(D|)P(:) Posterior has the same parametric form as the prior Dirichlet prior is a conjugate family for the multinomial likelihood Conjugate families are useful since: Many distributions can be represented with hyperparameters They allow for sequential update within the same representation In many cases we have closed-form solutions for prediction

  26. Bayesian Estimation in BayesNets Instances are independent given the parameters (x[m’],y[m’]) are d-separated from (x[m],y[m]) given  Priors for individual variables are a priori independent Global independence of parameters Bayesian network for parameter estimation Bayesian network X X … X[2] X[M] X[1] … Y Y[2] Y[M] Y[1] Y|X

  27. Bayesian Estimation in BayesNets Posteriors of  are independent given complete data Complete data d-separates parameters for different CPDs As in MLE, we can solve each estimation problem separately Bayesian network for parameter estimation Bayesian network X X … X[2] X[M] X[1] … Y Y[2] Y[M] Y[1] Y|X

  28. Bayesian Estimation in BayesNets Posteriors of  are independent given complete data Also holds for parameters within families Note context specific independence between Y|X=0 and Y|X=1 when given both X and Y Bayesian network for parameter estimation Bayesian network X X … X[2] X[M] X[1] … Y Y[2] Y[M] Y[1] Y|X=0 Y|X=1

  29. Bayesian Estimation in BayesNets Posteriors of  can be computed independently For multinomial Xi|paiposterior is Dirichlet with parameters Xi=1|pai+M[Xi=1|pai],..., Xi=k|pai+M[Xi=k|pai] Bayesian network for parameter estimation Bayesian network X X … X[2] X[M] X[1] … Y Y[2] Y[M] Y[1] Y|X=0 Y|X=1

  30. Assessing Priors for BayesNets We need the (xi,pai) for each node xi We can use initial parameters 0 as prior information Need also an equivalent sample size parameter M’ Then, we let (xi,pai) = M’  P(xi,pai|0) This allows to update a network using new data • Example network for priors • P(X=0)=P(X=1)=0.5 • P(Y=0)=P(Y=1)=0.5 • M’=1 • Note: (x0)=0.5 (x0,y0)=0.25 X Y

  31. Case Study: ICU Alarm Network The “Alarm” network 37 variables Experiment Sample instances Learn parameters MLE Bayesian MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV PVSAT ANAPHYLAXIS ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP

  32. Case Study: ICU Alarm Network MLE performs worst Prior M’=5 provides best smoothing MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50 1.4 1.2 1 0.8 KL Divergence 0.6 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 M

  33. Parameter Estimation Summary Estimation relies on sufficient statistics For multinomials these are of the form M[xi,pai] Parameter estimation Bayesian methods also require choice of priors MLE and Bayesian are asymptotically equivalent Both can be implemented in an online manner by accumulating sufficient statistics Bayesian (Dirichlet) MLE

  34. THE END

  35. Approaches to structure learning C B B • Constraint-based: • dependency from statistical tests (eg. 2) • deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993)

  36. Approaches to structure learning C B B • Constraint-based: • dependency from statistical tests (eg. 2) • deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993)

  37. Approaches to structure learning C B B • Constraint-based: • dependency from statistical tests (eg. 2) • deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993)

  38. Approaches to structure learning C B B • Constraint-based: • dependency from statistical tests (eg. 2) • deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Attempts to reduce inductive problem to deductive problem

  39. Approaches to structure learning C B B • Constraint-based: • dependency from statistical tests (eg. 2) • deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) • Bayesian: • compute posterior probability of structures, given observed data C C B B E E P(h1|data) P(h0|data) P(h|data)  P(data|h) P(h) (Heckerman, 1998; Friedman, 1999)

  40. Bayesian Occam’s Razor h0 (no relationship) P(d | h ) h1 (relationship) All possible data sets d For any model h,

  41. Unknown Structure, Complete Data Goal: Structure learning & parameter estimation Data does not contain missing values X1 X2 X1 X2 Inducer Initial network Y Y InputData

  42. Known Structure, Incomplete Data Goal: Parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y InputData

  43. Unknown Structure, Incomplete Data Goal: Structure learning & parameter estimation Data contains missing values X1 X2 X1 X2 Inducer Initial network Y Y InputData

More Related