CS b553 : A lgorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning Bayesian Networks

agenda • Bayesian networks • Chain rule for Bayes nets • Naïve Bayes models • Independence declarations • D-separation • Probabilistic inference queries

Purposes of bayesianNetworks • Efficient and intuitive modeling of complex causal interactions • Compact representation of joint distributions O(n) rather than O(2n) • Algorithms for efficient inference with given evidence (more on this next time)

Independence of random variables • Two random variables a and b are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A) • Knowing b doesn’t give you any information about a • [This equality has to hold for all combinations of values that Aand B can take on, i.e., all events A=a and B=b are independent]

Significance of independence • If A and B are independent, then P(A,B) = P(A) P(B) • => The joint distribution over A and B can be defined as a product over the distribution of Aand the distribution of B • => Store two much smaller probability tables rather than a large probability table over all combinations of Aand B

Conditional Independence • Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C)hence P(A|B,C) = P(A|C) • Once you know C, learning Bdoesn’t give you any information about A • [again, this has to hold for all combinations of values that A,B,C can take on]

Significance of Conditional independence • Consider Grade(CS101), Intelligence, and SAT • Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores • but good students are more likely to get good SAT scores, so they are not independent… • It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence

bayesianNetwork • Explicitly represent independence among propositions • Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I) Intel. Grade SAT 6probabilities, instead of 11

Definition: bayesian network • Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn) • Each node has a set of parents PaX • Graph must be a DAG • Each node also maintains a conditional probability distribution (often, a table) • P(X|PaX) • 2k-1entries for binary valued variables • Overall: O(n2k) storage for binary variables • Encodes the joint probability over X1,…,Xn

Burglary Earthquake Alarm JohnCalls MaryCalls Calculation of joint Probability P(jmabe) = ??

Burglary Earthquake alarm JohnCalls MaryCalls Calculation of joint Probability P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

Burglary Earthquake alarm P(x1x2…xn) = Pi=1,…,nP(xi|paXi) johnCalls maryCalls  full joint distribution Calculation of joint Probability P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

Chain Rule for Bayes Nets • Joint distribution is a product of all CPTs • P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)

Example: Naïve bayes models • P(Cause,Effect1,…,Effectn)= P(Cause) PiP(Effecti| Cause) Cause Effect1 Effect2 Effectn

Advantages of Bayes Nets (and other graphical models) • More manageable # of parameters to set and store • Incremental modeling • Explicit encoding of independence assumptions • Efficient inference techniques

Arcs do not necessarily encode causality A C C B B B C A A 2 BN’s with the same expressive power, and a 3rd with greater power (exercise)

Reading off independence relationships • Given B, does the value of A affect the probability of C? • P(C|B,A) = P(C|B)? • No! • C parent’s (B) are given, and so it is independent of its non-descendents (A) • Independence is symmetric:C  A | B => A  C | B A B C

Basic Rule • A node is independent of its non-descendants given its parents (and given nothing else)

Burglary Earthquake Alarm JohnCalls MaryCalls What does the BN encode? Burglary  Earthquake JohnCallsMaryCalls | Alarm JohnCalls Burglary | Alarm JohnCalls Earthquake | Alarm MaryCalls Burglary | Alarm MaryCalls Earthquake | Alarm A node is independent of its non-descendents, given its parents

Burglary Earthquake Alarm JohnCalls MaryCalls Reading off independence relationships • How about Burglary Earthquake | Alarm ? • No! Why?

Burglary Earthquake Alarm JohnCalls MaryCalls Reading off independence relationships • How about Burglary  Earthquake | Alarm ? • No! Why? • P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075 • P(B|A)P(E|A) = 0.086

Burglary Earthquake Alarm JohnCalls MaryCalls Reading off independence relationships • How about Burglary  Earthquake | JohnCalls? • No! Why? • Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent

Independence relationships • For polytrees, there exists a unique undirected path between A and B. For each node on the path: • Evidence on the directed road XEY or XEY makes X and Y independent • Evidence on an XEY makes descendants independent • Evidence on a “V” node, or below the V: XEY, or XWY with W…Emakes the X and Y dependent(otherwise they are independent)

General case • Formal property in general case: • D-separation : the above properties hold for all (acyclic) paths between A and B • D-separation  independence • That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation • The CPTs may indeed encode additional independences

Probability Queries • Given: some probabilistic model over variables X • Find: distribution over YX given evidence E=e for some subset E X / Y • P(Y|E=e) • Inference problem

Answering Inference Problems with the Joint Distribution • Easiest case: Y=X/E • P(Y|E=e) = P(Y,e)/P(e) • Denominator makes the probabilities sum to 1 • Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e) • Otherwise, let Z=X/(EY) • P(Y|E=e) = Sz P(Y,Z=z,e) /P(e) • P(e) = SySz P(Y=y,Z=z,e) • Inference with joint distribution: O(2|X/E|) for binary variables

P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn) = 1/Z P(C)Pi P(Fi|C) Given features, what class? Naïve bayesClassifier • P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class) Spam / Not Spam English / French / Latin … Class Feature1 Feature2 Featuren Word occurrences

For General Queries • For BNs and queries in general, it’s not that simple… more in later lectures. • Next class: skim 5.1-3, begin reading 9.1-4

CS b553 : A lgorithms for Optimization and Learning