CS b553 : A lgorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning Continuous Probability Distributions and Bayesian Networks with Continuous Variables

Agenda • Continuous probability distributions • Common families: • The Gaussian distribution • Linear Gaussian Bayesian networks

Continuous probability distributions • Let X be a random variable in R, P(X) be a probability distribution over X • P(x)0 for all x, “sums to 1” • Challenge: (most of the time) P(X=x) = 0 for any x

CDF and PDF • Probability density function (pdf)f(x) • Nonnegative,  f(x) dx = 1 • Cumulative distribution function (cdf) g(x) • g(x) = P(Xx) • g(-) = 0, g() = 1, g(x) = (-,x] f(y) dy, monotonic • f(x) = g’(x) pdf f(x) Both cdfs and pdfs are complete representations of the probability space over X, but usually pdfs are more intuitive to work with. 1 cdf g(x)

Caveats • pdfs may exceed 1 • Deterministic values, or ones taking on a few discrete values, can be represented in terms of the Dirac delta functiona(x) pdf (an improper function) • a(x) = 0 if x  a • a(x) =  if x = a •  a(x) dx = 1

U(a1,b1) Common Distributions U(a2,b2) • Uniform distribution U(a,b) • p(x) = 1/(b-a) if x  [a,b], 0 otherwise • P(Xx) = 0 if x < a, (x-a)/(b-a)if x  [a,b], 1 otherwise • Gaussian (normal) distribution N(,) •  = mean,  = standard deviation • P(Xx) not closed form:(1+erf(x))/2 for N(0,1)

Multivariate Continuous Distributions • Consider c.d.f. g(x,y) = P(Xx,Yy) • g(-,y) = 0, g(x,-) = 0 • g(,) = 1 • g(x,) = P(Xx), g(,x) = P(Yy) • g monotonic • Its joint density is given by the p.d.f. f(x,y) iff • g(p,q) = (-,p] (-,q] f(x,y) dy dx • i.e. P(axXbx,ayYby) = [ax,bx] [ay,by] f(x,y) dy dx

Marginalization works over PDFs • Marginalizing f(x,y) over y: • If h(x) = (-,) f(x,y) dy, then h(x) is a p.d.f. for P(Xx) • Proof: • P(Xa) = P(Xa,Y) = g(a,) = (-,a] (-,) f(x,y) dydx • h(a) = d/da P(Xa) = d/da (-,a] (-,) f(x,y) dydx (definition) = (-,) f(a,y) dy (fundamental theorem of calculus) • So, the joint density contains all information needed to reconstruct the density of each individual variable

Conditional densities • We might want to represent the density P(Y|X=x)… but how? • Naively, P(aYb|X=x) = P(aYb,X=x)/P(X=x), but denominator is 0! • Consider pdf p(x,y), consider taking limit P(aYb|x+eXx+e) as e0 • So p(x,y)/p(x) is the conditional density

Transformations of continuous random variables • Suppose we want to compute the distribution of f(X), where X is a random variable distributed w.r.t. pX(x) • Assume f is monotonic and invertible • Consider Y=f(X) a random variable • P(Yy) =  I[f(x)y]pX(x) dx= = P(Xf-1(y)) • pY(y) = d/dy P(Yy) = d/dyP(X f-1(y)) • = p(f-1(y)) d/dyf-1(y) by chain rule = pX(f-1(y))/f ’(f-1(y)) by inverse function derivative

Notes: • In general, continuous multivariate distributions are hard to handle exactly • But, there are specific classes that lead to efficient exact inference techniques • In particular, Gaussians • Other distributions usually require resorting to Monte Carlo approaches

Multivariate Gaussians X~ N(m,S) • Multivariate analog in N-D space • Mean (vector) m, covariance (matrix) S • With a normalization factor

Independence in Gaussians • If X ~ N(mX,SX) and Y ~ N(mY,SY) are independent, then • Moreover, if X~N(m,S), then Sij=0 iff Xi and Xj are independent

Linear Transformations • Linear transformations of gaussians • If X~ N(m,S), y = A x + b • Then Y ~ N(Am+b, ASAT) • In fact, • Consequence: • If X~ N(mx,Sx), Y ~ N(my,Sy), Z=X+Y • Then Z ~ N(mx+my,Sx+Sy)

Marginalization and Conditioning • If (X,Y) ~ N([mXmY],[SXX,SXY;SYX,SYY]), then: • Marginalization • Summing out Y givesX ~ N(mX , SXX) • Conditioning: • On observing Y=y, we haveX ~ N(mX-SXYSYY-1(y-mY), SXX-SXYSYY-1SYX)

Linear Gaussian Models • A conditional linear Gaussian model has : • P(Y|X=x) = N(0+Ax,S0) • With parameters 0, A, and S0

Linear Gaussian Models • A conditional linear Gaussian model has : • P(Y|X=x) = N(0+Ax,S0) • With parameters 0, A, and S0 • If X ~ N(mX,SX), then joint distribution over is given by: (Recall the linear transformation rule) If X~ N(m,S) and y=Ax+b, then Y ~ N(Am+b, ASAT)

CLG Bayesian Networks • If all variables in a Bayesian network have Gaussian or CLG CPTS, inference can be done efficiently! P(X2) = N(2,2) P(X1) = N(1,1) X1 X2 P(Y|x1,x2) = N(ax1+bx2,y) Y Z P(Z|x1,y) = N(c+dx1+ey,z)

Canonical Representation • All factors in a CLG Bayes net can be represented as C(x;K,h,g) with C(x;K,h,g) = exp(-1/2 xTKx + hTx + g) • Ex: if P(Y|x) = N(0+Ax,S0) thenP(y|x) = 1/Z exp(-1/2 (y-Ax-0)TS0-1(y-Ax-0)) =1/Z exp(-1/2 (y,x)T [I –A]TS0-1 [I –A](y,x) + 0TS0-1 [I –A](y,x) – ½ 0TS0-10) • Is of form C((y,x);K,h,g) with • K = [I –A]TS0-1 [I –A] • h = [I –A]TS0-1 0 • g= log(1/Z) exp(–½ 0TS0-10)

Product Operations • C(x;K1,h1,g1)C(x;K2,h2,g2) = C(x;K,h,g) with • K=K1+K2 • h=h1+h2 • g = g1+g2 • If the scopes of the two factors are not equivalent, just extend the K’s with 0 rows and columns, and h’s with 0 rows so that each row/column matches

Sum Operation • C((x,y);K,h,g)dy = C(x;K’,h’,g’)with • K’=KXX-KXYKYY-1KYX • h’=hX-KXYKYY-1hY • g’ = g+1/2 (log|2pKYY-1|+hYTKYY-1hY) • Using these two operations we can implement inference algorithms developed for discrete Bayes nets: • Top-down inference, variable elimination (exact) • Belief propagation (approximate)

Monte Carlo with Gaussians • Assume sample X ~ N(0,1) is given as a primitive RandN() • To sample X ~ N(m,s2), simply m+sRandN() • How to generate a random multivariate Gaussian variable N(m,S)?

Monte Carlo with Gaussians • Assume sample X ~ N(0,1) is given as a primitive RandN() • To sample X ~ N(m,s2), simply set x m+sRandN() • How to generate a random multivariate Gaussian variable N(m,S)? • Take Cholesky decomposition: S-1=LLT, L invertible if S is positive definite • Let y = LT(x-m) • P(y)  exp(-1/2 (y12 + … + yN2)) is isotropic, and each yiis independent • Sample each component of y at random • Set x L-Ty+m

Monte Carlo With Likelihood Weighting • Monte Carlo with rejection has probability 0 of finding a continuous value given as evidence, so likelihood weighting must be used P(X)=N(mX,SX) X Step 1: Sample x ~ N(mX,SX) Step 2: weight by P(y|x) P(Y|x)=N(Ax+mY,SY) Y=y

Hybrid Networks • Hybrid networks combine both discrete and continuous variables • Exact inference techniques are hard to apply • Result in Gaussian mixtures • NP hard even in polytree networks • Monte Carlo techniques apply in straightforward way • Belief approximation can be applied (e.g., collapsing Gaussian mixtures to single Gaussians)

Issues • Non-gaussian distributions • Nonlinear dependencies • More in future lectures on particle filtering

CS b553 : A lgorithms for Optimization and Learning