370 likes | 464 Views
CS b553 : A lgorithms for Optimization and Learning. Monte Carlo Methods for Probabilistic Inference. Agenda. Monte Carlo methods O(1/ sqrt (N)) standard deviation For Bayesian inference Likelihood weighting Gibbs sampling. Monte Carlo Integration. Estimate large integrals/sums:
E N D
CS b553: Algorithms for Optimization and Learning Monte Carlo Methods for Probabilistic Inference
Agenda • Monte Carlo methods • O(1/sqrt(N)) standard deviation • For Bayesian inference • Likelihood weighting • Gibbs sampling
Monte Carlo Integration • Estimate large integrals/sums: • I = f(x)p(x) dx • I = f(x)p(x) • Using a sample of N i.i.d. samples from p(x) • I 1/N f(x(i)) • Examples: • [a,b]f(x) dx (b-a)/N Sf(x(i)) • E[X] = x p(x) dx 1/N S x(i) • Volume of a set in Rn
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]?
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation)
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN)
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN) = 1/N S(E[f(x)]-E[f(x(i))]) = 1/N S0 (x and x(i) are distributed w.r.t. p(x)) = 0
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]?
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition)
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance)
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance) = 1/N2SVar[f(x(i))] (variance of a sum of independent variables)
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2Var[Sf(x(i))] (scaling of variance) = 1/N2SVar[f(x(i))] = 1/N Var[f(x)] (i.i.d. sample)
Mean & Variance of estimate • Let IN be the random variable denoting the estimate of the integral with N samples • What is the bias (mean error) E[I-IN]? • Unbiased estimator • What is the variance Var[IN]? • 1/N Var[f(x)] • Standard deviation: O(1/sqrt(N))
Approximate Inference Through Sampling • Unconditional simulation: • To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed
Approximate Inference Through Sampling • Unconditional simulation: • To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed • Conditional simulation: • To estimate the probability P(H) that a coin picked out of bucket B flips heads: • Repeat for i=1,…,N: • Pick a coin C out of a random bucket b(i) chosen with probability P(B) • h(i) = flip C according to probability P(H|b(i)) • Sample (h(i),b(i)) comes from distribution P(H,B) • Result approximates P(H,B)
Monte Carlo Inference In Bayes Nets • BN over variables X • Repeat for i=1,…,N • In top-down order, generate x(i)as follows: • Sample xj(i) ~ P(Xj|paXj(i)) • (RHS is taken by putting parent values in sample into the CPT for Xj) • Sample x(1)…x(N) approximates the distribution over X
Burglary Earthquake Alarm JohnCalls MaryCalls Approximate Inference: Monte-Carlo Simulation • Sample from the joint distribution B=0 E=0 A=0 J=1 M=0
Approximate Inference: Monte-Carlo Simulation • As more samples are generated, the distribution of the samples approaches the joint distribution B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0
Basic method for Handling Evidence • Inference: given evidence E=e (e.g., J=1), approximate P(X/E|E=e) • Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution
Rare Event Problem: • What if some events are really rare (e.g., burglary & earthquake ?) • # of samples must be huge to get a reasonable estimate • Solution: likelihood weighting • Enforce that each sample agrees with evidence • While generating a sample, keep track of the ratio of • (how likely the sampled value is to occur in the real world)(how likely you were to generate the sampled value)
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=1
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.008 B=0 E=1
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0023 B=0 E=1 A=1 A=1 is enforced, and the weight updated to reflect the likelihood that this occurs
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0016 B=0 E=1 A=1 M=1 J=1
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=3.988 B=0 E=0
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.004 B=0 E=0 A=1
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0028 B=0 E=0 A=1 M=1 J=1
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.00375 B=1 E=0 A=1
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=0.0026 B=1 E=0 A=1 M=1 J=1
Burglary Earthquake Alarm JohnCalls MaryCalls Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 w=5e-7 B=1 E=1 A=1 M=1 J=1
Likelihood weighting • Suppose evidence Alarm & MaryCalls • Sample B,E with P=0.5 • N=4 gives P(B|A,M)~=0.371 • Exact inference gives P(B|A,M) = 0.375 w=0.0016 w=0.0028 w=0.0026 w~=0 B=0 E=1 A=1 M=1 J=1 B=0 E=0 A=1 M=1 J=1 B=1 E=0 A=1 M=1 J=1 B=1 E=1 A=1 M=1 J=1
Another Rare-Event Problem • B=b given as evidence • Probability each bi is rare given all but one setting of Ai(say, Ai=1) • Chance of sampling all 1’s is very low => most likelihood weights will be too low • Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b)) A1 A2 A10 B1 B2 B10
Gibbs Sampling • Idea: reduce the computational burden of sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes • Cycle through j=1,…,n • Sample xj ~ P(xj | x[1…j-1,j+1,…n]) • Over the long run, the random walk taken by x approaches the true distribution P(x)
Gibbs Sampling in BNs • Each Gibbs sampling step: 1) pick a variable Xi, 2) sample xi ~ P(Xi|X/Xi) • Look at values of “Markov blanket” of Xi: • Parents PaXi • Children Y1,…,Yk • Parents of children (excluding Xi) PaY1/Xi, …,PaYk/Xi • Xi is independent of rest of network given Markov blanket • Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi)= 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) • Product of Xi’s factor and the factors of its children
Handling evidence • Simply set each evidence variable to its appropriate value, don’t sample • Resulting walk approximates distribution P(X/E|E=e) • Uses evidence more efficiently than likelihood weighting
Gibbs sampling issues • Demonstrating correctness & convergence requires examining Markov Chain random walk (more later) • Need to take many steps before the effects of poor initialization wear off (mixing time) • Difficult to tell how much is needed a priori • Numerous variants • Known as Markov Chain Monte Carlo techniques
Next time • Continuous and hybrid distributions