1 / 26

CS b553 : A lgorithms for Optimization and Learning

CS b553 : A lgorithms for Optimization and Learning. Continuous Probability Distributions and Bayesian Networks with Continuous Variables. Agenda. Continuous probability distributions Common families: The G aussian distribution Linear Gaussian Bayesian networks.

berit
Download Presentation

CS b553 : A lgorithms for Optimization and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b553: Algorithms for Optimization and Learning Continuous Probability Distributions and Bayesian Networks with Continuous Variables

  2. Agenda • Continuous probability distributions • Common families: • The Gaussian distribution • Linear Gaussian Bayesian networks

  3. Continuous probability distributions • Let X be a random variable in R, P(X) be a probability distribution over X • P(x)0 for all x, “sums to 1” • Challenge: (most of the time) P(X=x) = 0 for any x

  4. CDF and PDF • Probability density function (pdf)f(x) • Nonnegative,  f(x) dx = 1 • Cumulative distribution function (cdf) g(x) • g(x) = P(Xx) • g(-) = 0, g() = 1, g(x) = (-,x] f(y) dy, monotonic • f(x) = g’(x) pdf f(x) Both cdfs and pdfs are complete representations of the probability space over X, but usually pdfs are more intuitive to work with. 1 cdf g(x)

  5. Caveats • pdfs may exceed 1 • Deterministic values, or ones taking on a few discrete values, can be represented in terms of the Dirac delta functiona(x) pdf (an improper function) • a(x) = 0 if x  a • a(x) =  if x = a •  a(x) dx = 1

  6. U(a1,b1) Common Distributions U(a2,b2) • Uniform distribution U(a,b) • p(x) = 1/(b-a) if x  [a,b], 0 otherwise • P(Xx) = 0 if x < a, (x-a)/(b-a)if x  [a,b], 1 otherwise • Gaussian (normal) distribution N(,) •  = mean,  = standard deviation • P(Xx) not closed form:(1+erf(x))/2 for N(0,1)

  7. Multivariate Continuous Distributions • Consider c.d.f. g(x,y) = P(Xx,Yy) • g(-,y) = 0, g(x,-) = 0 • g(,) = 1 • g(x,) = P(Xx), g(,x) = P(Yy) • g monotonic • Its joint density is given by the p.d.f. f(x,y) iff • g(p,q) = (-,p] (-,q] f(x,y) dy dx • i.e. P(axXbx,ayYby) = [ax,bx] [ay,by] f(x,y) dy dx

  8. Marginalization works over PDFs • Marginalizing f(x,y) over y: • If h(x) = (-,) f(x,y) dy, then h(x) is a p.d.f. for P(Xx) • Proof: • P(Xa) = P(Xa,Y) = g(a,) = (-,a] (-,) f(x,y) dydx • h(a) = d/da P(Xa) = d/da (-,a] (-,) f(x,y) dydx (definition) = (-,) f(a,y) dy (fundamental theorem of calculus) • So, the joint density contains all information needed to reconstruct the density of each individual variable

  9. Conditional densities • We might want to represent the density P(Y|X=x)… but how? • Naively, P(aYb|X=x) = P(aYb,X=x)/P(X=x), but denominator is 0! • Consider pdf p(x,y), consider taking limit P(aYb|x+eXx+e) as e0 • So p(x,y)/p(x) is the conditional density

  10. Transformations of continuous random variables • Suppose we want to compute the distribution of f(X), where X is a random variable distributed w.r.t. pX(x) • Assume f is monotonic and invertible • Consider Y=f(X) a random variable • P(Yy) =  I[f(x)y]pX(x) dx= = P(Xf-1(y)) • pY(y) = d/dy P(Yy) = d/dyP(X f-1(y)) • = p(f-1(y)) d/dyf-1(y) by chain rule = pX(f-1(y))/f ’(f-1(y)) by inverse function derivative

  11. Notes: • In general, continuous multivariate distributions are hard to handle exactly • But, there are specific classes that lead to efficient exact inference techniques • In particular, Gaussians • Other distributions usually require resorting to Monte Carlo approaches

  12. Multivariate Gaussians X~ N(m,S) • Multivariate analog in N-D space • Mean (vector) m, covariance (matrix) S • With a normalization factor

  13. Independence in Gaussians • If X ~ N(mX,SX) and Y ~ N(mY,SY) are independent, then • Moreover, if X~N(m,S), then Sij=0 iff Xi and Xj are independent

  14. Linear Transformations • Linear transformations of gaussians • If X~ N(m,S), y = A x + b • Then Y ~ N(Am+b, ASAT) • In fact, • Consequence: • If X~ N(mx,Sx), Y ~ N(my,Sy), Z=X+Y • Then Z ~ N(mx+my,Sx+Sy)

  15. Marginalization and Conditioning • If (X,Y) ~ N([mXmY],[SXX,SXY;SYX,SYY]), then: • Marginalization • Summing out Y givesX ~ N(mX , SXX) • Conditioning: • On observing Y=y, we haveX ~ N(mX-SXYSYY-1(y-mY), SXX-SXYSYY-1SYX)

  16. Linear Gaussian Models • A conditional linear Gaussian model has : • P(Y|X=x) = N(0+Ax,S0) • With parameters 0, A, and S0

  17. Linear Gaussian Models • A conditional linear Gaussian model has : • P(Y|X=x) = N(0+Ax,S0) • With parameters 0, A, and S0 • If X ~ N(mX,SX), then joint distribution over is given by: (Recall the linear transformation rule) If X~ N(m,S) and y=Ax+b, then Y ~ N(Am+b, ASAT)

  18. CLG Bayesian Networks • If all variables in a Bayesian network have Gaussian or CLG CPTS, inference can be done efficiently! P(X2) = N(2,2) P(X1) = N(1,1) X1 X2 P(Y|x1,x2) = N(ax1+bx2,y) Y Z P(Z|x1,y) = N(c+dx1+ey,z)

  19. Canonical Representation • All factors in a CLG Bayes net can be represented as C(x;K,h,g) with C(x;K,h,g) = exp(-1/2 xTKx + hTx + g) • Ex: if P(Y|x) = N(0+Ax,S0) thenP(y|x) = 1/Z exp(-1/2 (y-Ax-0)TS0-1(y-Ax-0)) =1/Z exp(-1/2 (y,x)T [I –A]TS0-1 [I –A](y,x) + 0TS0-1 [I –A](y,x) – ½ 0TS0-10) • Is of form C((y,x);K,h,g) with • K = [I –A]TS0-1 [I –A] • h = [I –A]TS0-1 0 • g= log(1/Z) exp(–½ 0TS0-10)

  20. Product Operations • C(x;K1,h1,g1)C(x;K2,h2,g2) = C(x;K,h,g) with • K=K1+K2 • h=h1+h2 • g = g1+g2 • If the scopes of the two factors are not equivalent, just extend the K’s with 0 rows and columns, and h’s with 0 rows so that each row/column matches

  21. Sum Operation • C((x,y);K,h,g)dy = C(x;K’,h’,g’)with • K’=KXX-KXYKYY-1KYX • h’=hX-KXYKYY-1hY • g’ = g+1/2 (log|2pKYY-1|+hYTKYY-1hY) • Using these two operations we can implement inference algorithms developed for discrete Bayes nets: • Top-down inference, variable elimination (exact) • Belief propagation (approximate)

  22. Monte Carlo with Gaussians • Assume sample X ~ N(0,1) is given as a primitive RandN() • To sample X ~ N(m,s2), simply m+sRandN() • How to generate a random multivariate Gaussian variable N(m,S)?

  23. Monte Carlo with Gaussians • Assume sample X ~ N(0,1) is given as a primitive RandN() • To sample X ~ N(m,s2), simply set x m+sRandN() • How to generate a random multivariate Gaussian variable N(m,S)? • Take Cholesky decomposition: S-1=LLT, L invertible if S is positive definite • Let y = LT(x-m) • P(y)  exp(-1/2 (y12 + … + yN2)) is isotropic, and each yiis independent • Sample each component of y at random • Set x L-Ty+m

  24. Monte Carlo With Likelihood Weighting • Monte Carlo with rejection has probability 0 of finding a continuous value given as evidence, so likelihood weighting must be used P(X)=N(mX,SX) X Step 1: Sample x ~ N(mX,SX) Step 2: weight by P(y|x) P(Y|x)=N(Ax+mY,SY) Y=y

  25. Hybrid Networks • Hybrid networks combine both discrete and continuous variables • Exact inference techniques are hard to apply • Result in Gaussian mixtures • NP hard even in polytree networks • Monte Carlo techniques apply in straightforward way • Belief approximation can be applied (e.g., collapsing Gaussian mixtures to single Gaussians)

  26. Issues • Non-gaussian distributions • Nonlinear dependencies • More in future lectures on particle filtering

More Related