CMSC 671 Fall 2010

CMSC 671Fall 2010 Class #18/19 – Wednesday, November 3 / Monday, November 8 Some material borrowed with permission from Lise Getoor

Next two classes • Probability theory (quick review!) • Bayesian networks • Network structure • Conditional probability tables • Conditional independence • Bayesian inference • From the joint distribution • Using independence/factoring • From sources of evidence

Bayesian Reasoning Chapter 13

Sources of uncertainty • Uncertain inputs • Missing data • Noisy data • Uncertain knowledge • Multiple causes lead to multiple effects • Incomplete enumeration of conditions or effects • Incomplete knowledge of causality in the domain • Probabilistic/stochastic effects • Uncertain outputs • Abduction and induction are inherently uncertain • Default reasoning, even in deductive fashion, is uncertain • Incomplete deductive inference may be uncertain Probabilistic reasoning only gives probabilistic results (summarizes uncertainty from various sources)

Decision making with uncertainty • Rational behavior: • For each possible action, identify the possible outcomes • Compute the probability of each outcome • Compute the utility of each outcome • Compute the probability-weighted (expected) utility over possible outcomes for each action • Select the action with the highest expected utility (principle of Maximum Expected Utility)

Why probabilities anyway? • Kolmogorov showed that three simple axioms lead to the rules of probability theory • De Finetti, Cox, and Carnap have also provided compelling arguments for these axioms • All probabilities are between 0 and 1: • 0 ≤ P(a) ≤ 1 • Valid propositions (tautologies) have probability 1, and unsatisfiable propositions have probability 0: • P(true) = 1 ; P(false) = 0 • The probability of a disjunction is given by: • P(a  b) = P(a) + P(b) – P(a  b) a ab b

Random variables Domain Atomic event: complete specification of state Prior probability: degree of belief without any other evidence Joint probability: matrix of combined probabilities of a set of variables Alarm, Burglary, Earthquake Boolean (like these), discrete, continuous Alarm=True  Burglary=True  Earthquake=Falsealarm  burglary ¬earthquake P(Burglary) = .1 P(Alarm, Burglary) = Probability theory

Conditional probability: probability of effect given causes Computing conditional probs: P(a | b) = P(a  b) / P(b) P(b): normalizing constant Product rule: P(a  b) = P(a | b) P(b) Marginalizing: P(B) = ΣaP(B, a) P(B) = ΣaP(B | a) P(a) (conditioning) P(burglary | alarm) = .47P(alarm | burglary) = .9 P(burglary | alarm) = P(burglary  alarm) / P(alarm) = .09 / .19 = .47 P(burglary  alarm) = P(burglary | alarm) P(alarm) = .47 * .19 = .09 P(alarm) = P(alarm  burglary) + P(alarm ¬burglary) = .09+.1 = .19 Probability theory (cont.)

Example: Inference from the joint P(Burglary | alarm) = α P(Burglary, alarm) = α [P(Burglary, alarm, earthquake) + P(Burglary, alarm, ¬earthquake) = α [ (.01, .01) + (.08, .09) ] = α [ (.09, .1) ] Since P(burglary | alarm) + P(¬burglary | alarm) = 1, α = 1/(.09+.1) = 5.26 (i.e., P(alarm) = 1/α = .19 – quizlet: how can you verify this?) P(burglary | alarm) = .09 * 5.26 = .474 P(¬burglary | alarm) = .1 * 5.26 = .526

Exercise: Inference from the joint • Queries: • What is the prior probability of smart? • What is the prior probability of study? • What is the conditional probability of prepared, given study and smart? • Save these answers for next time! 

Independence • When two sets of propositions do not affect each others’ probabilities, we call them independent, and can easily compute their joint and conditional probability: • Independent (A, B) → P(A  B) = P(A) P(B), P(A | B) = P(A) • For example, {moon-phase, light-level} might be independent of {burglary, alarm, earthquake} • Then again, it might not: Burglars might be more likely to burglarize houses when there’s a new moon (and hence little light) • But if we know the light level, the moon phase doesn’t affect whether we are burglarized • Once we’re burglarized, light level doesn’t affect whether the alarm goes off • We need a more complex notion of independence, and methods for reasoning about these kinds of relationships

Exercise: Independence • Queries: • Is smart independent of study? • Is prepared independent of study?

Conditional independence • Absolute independence: • A and B are independent if P(A  B) = P(A) P(B); equivalently, P(A) = P(A | B) and P(B) = P(B | A) • A and B are conditionally independent given C if • P(A  B | C) = P(A | C) P(B | C) • This lets us decompose the joint distribution: • P(A  B  C) = P(A | C) P(B | C) P(C) • Moon-Phase and Burglary are conditionally independent given Light-Level • Conditional independence is weaker than absolute independence, but still useful in decomposing the full joint probability distribution

Exercise: Conditional independence • Queries: • Is smart conditionally independent of prepared, given study? • Is study conditionally independent of prepared, given smart?

Bayes’s rule • Bayes’s rule is derived from the product rule: • P(Y | X) = P(X | Y) P(Y) / P(X) • Often useful for diagnosis: • If X are (observed) effects and Y are (hidden) causes, • We may have a model for how causes lead to effects (P(X | Y)) • We may also have prior beliefs (based on experience) about the frequency of occurrence of effects (P(Y)) • Which allows us to reason abductively from effects to causes (P(Y | X)).

Bayesian inference • In the setting of diagnostic/evidential reasoning • Know prior probability of hypothesis conditional probability • Want to compute the posterior probability • Bayes’ theorem (formula 1):

Simple Bayesian diagnostic reasoning • Knowledge base: • Evidence / manifestations: E1, … Em • Hypotheses / disorders: H1, … Hn • Ej and Hi are binary; hypotheses are mutually exclusive (non-overlapping) and exhaustive (cover all possible cases) • Conditional probabilities: P(Ej | Hi), i = 1, … n; j = 1, … m • Cases (evidence for a particular instance): E1, …, El • Goal: Find the hypothesis Hi with the highest posterior • Maxi P(Hi | E1, …, El)

Bayesian diagnostic reasoning II • Bayes’ rule says that • P(Hi | E1, …, El) = P(E1, …, El | Hi) P(Hi) / P(E1, …, El) • Assume each piece of evidence Ei is conditionally independent of the others, given a hypothesis Hi, then: • P(E1, …, El | Hi) = lj=1 P(Ej | Hi) • If we only care about relative probabilities for the Hi, then we have: • P(Hi | E1, …, El) = αP(Hi) lj=1 P(Ej | Hi)

Limitations of simple Bayesian inference • Cannot easily handle multi-fault situation, nor cases where intermediate (hidden) causes exist: • Disease D causes syndrome S, which causes correlated manifestations M1 and M2 • Consider a composite hypothesis H1 H2, where H1 and H2 are independent. What is the relative posterior? • P(H1  H2 | E1, …, El) = αP(E1, …, El | H1  H2) P(H1  H2) = αP(E1, …, El | H1  H2) P(H1) P(H2) = αlj=1 P(Ej | H1  H2)P(H1) P(H2) • How do we compute P(Ej | H1  H2) ??

Limitations of simple Bayesian inference II • Assume H1 and H2 are independent, given E1, …, El? • P(H1  H2 | E1, …, El) = P(H1 | E1, …, El) P(H2 | E1, …, El) • This is a very unreasonable assumption • Earthquake and Burglar are independent, but not given Alarm: • P(burglar | alarm, earthquake) << P(burglar | alarm) • Another limitation is that simple application of Bayes’s rule doesn’t allow us to handle causal chaining: • A: this year’s weather; B: cotton production; C: next year’s cotton price • A influences C indirectly: A→ B → C • P(C | B, A) = P(C | B) • Need a richer representation to model interacting hypotheses, conditional independence, and causal chaining • Next time: conditional independence and Bayesian networks!

Bayesian Networks Chapter 14.1-14.3 Some material borrowedfrom Lise Getoor

Bayesian Belief Networks (BNs) • Definition: BN = (DAG, CPD) • DAG: directed acyclic graph (BN’s structure) • Nodes: random variables (typically binary or discrete, but methods also exist to handle continuous variables) • Arcs: indicate probabilistic dependencies between nodes (lack of link signifies conditional independence) • CPD: conditional probability distribution (BN’s parameters) • Conditional probabilities at each node, usually stored as a table (conditional probability table, or CPT) • Root nodes are a special case – no parents, so just use priors in CPD:

a b c d e Example BN P(A) = 0.001 P(C|A) = 0.2 P(C|A) = 0.005 P(B|A) = 0.3 P(B|A) = 0.001 P(D|B,C) = 0.1 P(D|B,C) = 0.01 P(D|B,C) = 0.01 P(D|B,C) = 0.00001 P(E|C) = 0.4 P(E|C) = 0.002 Note that we only specify P(A) etc., not P(¬A), since they have to add to one

Conditional independence and chaining • Conditional independence assumption where q is any set of variables (nodes) other than and its successors • blocks influence of other nodes on and its successors (q influences only through variables in ) • With this assumption, the complete joint probability distribution of all variables in the network can be represented by (recovered from) local CPDs by chaining these CPDs: q

a b c d e Chaining: Example Computing the joint probability for all variables is easy: P(a, b, c, d, e) = P(e | a, b, c, d) P(a, b, c, d) by the product rule = P(e | c) P(a, b, c, d) by cond. indep. assumption = P(e | c) P(d | a, b, c) P(a, b, c) = P(e | c) P(d | b, c) P(c | a, b) P(a, b) = P(e | c) P(d | b, c) P(c | a) P(b | a) P(a)

Topological semantics • A node is conditionally independent of its non-descendants given its parents • A node is conditionally independent of all other nodes in the network given its parents, children, and children’s parents (also known as its Markov blanket) • The method called d-separation can be applied to decide whether a set of nodes X is independent of another set Y, given a third set Z

Inference in Bayesian Networks Chapter 14.4-14.5 Some material borrowedfrom Lise Getoor

Inference tasks • Simple queries: Computer posterior marginal P(Xi | E=e) • E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false) • Conjunctive queries: • P(Xi, Xj | E=e) = P(Xi | e=e) P(Xj | Xi, E=e) • Optimal decisions:Decision networks include utility information; probabilistic inference is required to find P(outcome | action, evidence) • Value of information: Which evidence should we seek next? • Sensitivity analysis:Which probability values are most critical? • Explanation: Why do I need a new starter motor?

Approaches to inference • Exact inference • Enumeration • Belief propagation in polytrees • Variable elimination • Clustering / join tree algorithms • Approximate inference • Stochastic simulation / sampling methods • Markov chain Monte Carlo methods • Genetic algorithms • Neural networks • Simulated annealing • Mean field theory

Direct inference with BNs • Instead of computing the joint, suppose we just want the probability for one variable • Exact methods of computation: • Enumeration • Variable elimination • Join trees: get the probabilities associated with every query variable

Inference by enumeration • Add all of the terms (atomic event probabilities) from the full joint distribution • If E are the evidence (observed) variables and Y are the other (unobserved) variables, then: P(X|e) = α P(X, E) = α ∑ P(X, E, Y) • Each P(X, E, Y) term can be computed using the chain rule • Computationally expensive!

a b c d e Example: Enumeration • P(xi) = Σπi P(xi | πi) P(πi) • Suppose we want P(D=true), and only the value of E is given as true • P (d|e) =  ΣABCP(a, b, c, d, e) =  ΣABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c) • With simple iteration to compute this expression, there’s going to be a lot of repetition (e.g., P(e|c) has to be recomputed every time we iterate over C=true)

Exercise: Enumeration p(smart)=.8 p(study)=.6 smart study p(fair)=.9 prepared fair pass Query: What is the probability that a student studied, given that they pass the exam?

Variable elimination • Basically just enumeration, but with caching of local calculations • Linear for polytrees (singly connected BNs) • Potentially exponential for multiply connected BNs • Exact inference in Bayesian networks is NP-hard! • Join tree algorithms are an extension of variable elimination methods that compute posterior probabilities for all nodes in a BN simultaneously

Variable elimination General idea: • Write query in the form • Iteratively • Move all irrelevant terms outside of innermost sum • Perform innermost sum, getting a new term • Insert the new term into the product

Cloudy Rain Sprinkler WetGrass Variable elimination: Example

Smoking Visit to Asia Tuberculosis Lung Cancer Abnormality in Chest Bronchitis Dyspnea X-Ray A more complex example • “Asia” network:

S V L T B A X D • We want to compute P(d) • Need to eliminate: v,s,x,t,l,a,b Initial factors

S V L T B A X D Compute: • We want to compute P(d) • Need to eliminate: v,s,x,t,l,a,b Initial factors Eliminate: v Note: fv(t) = P(t) In general, result of elimination is not necessarily a probability term

S V L T B A X D Compute: • We want to compute P(d) • Need to eliminate: s,x,t,l,a,b • Initial factors Eliminate: s Summing on s results in a factor with two arguments fs(b,l) In general, result of elimination may be a function of several variables

S V L T B A X D Compute: • We want to compute P(d) • Need to eliminate: x,t,l,a,b • Initial factors Eliminate: x Note: fx(a) = 1 for all values of a !!

S V L T B A X D Compute: • We want to compute P(d) • Need to eliminate: t,l,a,b • Initial factors Eliminate: t

S V L T B A X D Compute: • We want to compute P(d) • Need to eliminate: l,a,b • Initial factors Eliminate: l

S V L T B A X D Compute: • We want to compute P(d) • Need to eliminate: b • Initial factors Eliminate: a,b

S V L T B A X D Dealing with evidence • How do we deal with evidence? • Suppose we are give evidence V = t, S = f, D = t • We want to compute P(L, V = t, S = f, D = t)

S V L T B A X D Dealing with evidence • We start by writing the factors: • Since we know that V = t, we don’t need to eliminate V • Instead, we can replace the factors P(V) and P(T|V) with • These “select” the appropriate parts of the original factors given the evidence • Note that fp(V) is a constant, and thus does not appear in elimination of other variables

S V L T B A X D Dealing with evidence • Given evidence V = t, S = f, D = t • Compute P(L, V = t, S = f, D = t ) • Initial factors, after setting evidence:

S V L T B A X D Dealing with evidence • Given evidence V = t, S = f, D = t • Compute P(L, V = t, S = f, D = t ) • Initial factors, after setting evidence: • Eliminating x, we get

S V L T B A X D Dealing with evidence • Given evidence V = t, S = f, D = t • Compute P(L, V = t, S = f, D = t ) • Initial factors, after setting evidence: • Eliminating x, we get • Eliminating t, we get

S V L T B A X D Dealing with evidence • Given evidence V = t, S = f, D = t • Compute P(L, V = t, S = f, D = t ) • Initial factors, after setting evidence: • Eliminating x, we get • Eliminating t, we get • Eliminating a, we get

CMSC 671 Fall 2010