Artificial Intelligence

Artificial Intelligence CS 165A Tuesday, November 27, 2007 • Probabilistic Reasoning (Ch 14) Today

Notes • HW #5 now posted • Due in one week • No programming problems • Work on your own • Schedule from here on...

Marginalization • Given P(X, Y), derive P(X) (“marginalize away Y”) or • Equivalent: Given P(X|Y) and P(Y), derive P(X) or

Known ? Marginalization (cont.) • Marginalization is a common procedure • E.g., used to normalize Bayes’ rule • By marginalization, • P(D) = H P(D, H) = H P(D | H) P(H) • = P(D | H1) P(H1) + P(D | H2) P(H2) + … + P(D | HN) P(HN)

Computing probabilities • From the joint distribution P(X1, X2, …, XN) we can compute P(some variables) by summing over all the other variables • For example • P(X2, …, XN) = iP(X1i, X2, …, XN) • P(X1) = i j … pP(X1, X2i, X3j, …, XNp) • For binary variables, from P(X, Y) • P(Y) = iP(Xi,Y) = P(X, Y) + P(X, Y) • P(X) = iP(X,Yi) = P(X, Y) + P(X, Y)

Independence Absolute Independence • X and Y are independent iff • P(X,Y) = P(X) P(Y) [by definition] • P(X | Y) = P(X) Conditional Independence • If X and Y are (conditionally) independent given Z, then • P(X | Y, Z) = P(X | Z) • Example: • P(WetGrass | Season, Rain) = P(WetGrass | Rain) Since P(X | Y) = P(X, Y)/P(Y) = P(X) P(Y)/P(Y)

W E P(F | W, E) = P(F | E) F Conditional Independence • In practice, conditional independence is more common than absolute independence. • P(Final exam grade | Weather)  P(Final exam grade) • I.e., they are not independent • P(Final exam grade | Weather, Effort) = P(Final exam grade | Effort) • But they are conditionally independent given Effort • This leads to simplified rules for updating Bayes’ Rule, and then onward to Bayesian networks

Joint Probability • Joint probability: P(X1, X2, …, XN) • Defines the probability for any possible state of the world • 2N entries in the case of binary variables • Defined by 2N-1 independent numbers • What if not binary? • If the variables are independent, then P(X1, X2, …, XN) = P(X1) P(X2) …P(XN) • Binary case: Defined by N independent numbers • What if not binary?

The Chain Rule again

2N - 1 1 2N-1 2 4 The Chain Rule again (cont.) • Recursive definition: P(X1,X2, …,XN) = P(X1| X2, …,XN) P(X2|X3, …,XN) … P(XN-1|XN) P(XN) or equivalently = P(X1) P(X2| X1) P(X3| X2,X1) … P(XN|XN-1, …,X1) How many values needed to represent this? (assuming binary variables) 2N - 1 = 1 + 2 + 4 +  + 2N-1

Note on number of independent values…. • Random variables W, X, Y, and Z • W = { w1, w2, w3, w4} • X = { x1, x2} • Y = { y1, y2, y3} • Z = { z1, z2, z3, z4} • How many (independent) numbers are needed to describe: • P(W) • P(X, Y) • P(W, X, Y, Z) • P(X | Y) • P(W | X, Y, Z) • 4 - 1 = 3 • 2*3 – 1 = 5 • 4*2*3*4 – 1 = 95 • (2-1)*3 = 3 • (4-1)*(2*3*4) = 72

Benefit of conditional independence • If some variables are conditionally independent, the joint probability can be specified with many fewer than 2N-1 numbers (or 3N-1, or 10N-1, or…) • For example: (for binary variables W, X, Y, Z) • P(W,X,Y,Z) = P(W) P(X|W) P(Y|W,X) P(Z|W,X,Y) • 1 + 2 + 4 + 8 = 15 numbers to specify • But if Y and W are independent given X, and Z is independent of W and X given Y, then • P(W,X,Y,Z) = P(W) P(X|W) P(Y|X) P(Z|Y) • 1 + 2 + 2 + 2 = 7 numbers • This is often the case in real problems, and belief networks take advantage of this.

Belief Networks (a.k.a. Bayesian Networks) a.k.a. Probabilistic networks, Belief nets, Bayes nets, etc. • Belief network • A data structure (depicted as a graph) that represents the dependence among variables and allows us to concisely specify the joint probability distribution • The graph itself is known as an “influence diagram” • A belief network is a directed acyclic graph where: • The nodes represent the set of random variables (one node per random variable) • Arcs between nodes represent influence, or causality • A link from node X to node Y means that X “directly influences” Y • Each node has a conditional probability table (CPT) that definesP(node | parents)

X P(X) Y P(Y|X) Example • Random variables X and Y • X – It is raining • Y – The grass is wet • X has a causal effect on Y Or, Y is a symptom of X • Draw two nodes and link them • Define the CPT for each node • P(X) and P(Y | X) • Typical use: we observe Y and we want to query P(X | Y) • Y is an evidence variable – TELL(KB, Y) • X is a query variable – ASK(KB, X)

X P(X) Y P(Y|X) Try it… ASK(KB, X) • What is P(X | Y)? • Given that we know the CPTs of each node in the graph

X X Y P(X) P(X) P(Y) Y Z P(Y|X) P(Z|X,Y) Belief nets represent the joint probability • The joint probability function can be calculated directly from the network • It’s the product of the CPTs of all the nodes • P(var1, …, varN) = Πi P(vari|Parents(vari)) P(X,Y) = P(X) P(Y|X) P(X,Y,Z) = P(X) P(Y) P(Z|X,Y)

Example I’m at work and my neighbor John calls to say my home alarm is ringing, but my neighbor Mary doesn’t call. The alarm is sometimes triggered by minor earthquakes. Was there a burglar at my house? • Random (boolean) variables: • JohnCalls, MaryCalls, Earthquake, Burglar, Alarm • The belief net shows the causal links • This defines the joint probability • P(JohnCalls, MaryCalls, Earthquake, Burglar, Alarm) • What do we want to know? P(B | J, M) Why not P(B | J, A, M) ?

Example Links and CPTs?

Example Joint probability? P(J, M, A, B, E)?

By marginalization: Calculate P(B | J, M)

Example • Conditional independence is seen here • P(JohnCalls | MaryCalls, Alarm, Earthquake, Burglary) = P(JohnCalls | Alarm) • So JohnCalls is independent of MaryCalls, Earthquake, and Burglary, given Alarm • Does this mean that an earthquake or a burglary do not influence whether or not John calls? • No, but the influence is already accounted for in the Alarm variable • JohnCalls is conditionally independent of Earthquake, but not absolutely independent of it

Naive Bayes model • A common situation is when a single cause directly influences several variables, which are all conditionally independent, given the cause. P(C, e1, e2, e3) = P(C) P(e1 | C) P(e2 | C) P(e3 | C) C Rain In general, e1 e2 e3 Wet grass People with umbrellas Car accidents

Naive Bayes model • Typical query for naive Bayes: • Given some evidence, what’s the probability of the cause? • P(C | e1) = ? • P(C | e1, e3) = ? C Rain e1 e2 e3 Wet grass People with umbrellas Car accidents

X1 X2 X3 X4 X5 Drawing belief nets • What would a belief net look like if all the variables were fully dependent? P(X1,X2,X3,X4,X5) = P(X1)P(X2|X1)P(X3|X1,X2)P(X4|X1,X2,X3)P(X5|X1,X2,X3,X4) • But this isn’t the only way to draw the belief net when all the variables are fully dependent

X1 X2 X3 X4 X5 X1 X2 X1 X2 P(X1,X2) = ? P(X1,X2,X3,X4,X5) = ? and 119 others… Fully connected belief net • In fact, there are N! ways of connecting up a fully-connected belief net • That is, there are N! ways of ordering the nodes For N=2 For N=5

X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 Drawing belief nets (cont.) What if the variables are all independent? P(X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3) P(X4) P(X5) What if the links are drawn like this: Not allowed – not a DAG

X1 X1 X2 X3 X3 X2 X4 X4 X5 X5 All arrows going left-to-right Drawing belief nets (cont.) What if the links are drawn like this: P(X1, X2, X3, X4, X5) = P(X1) P(X2 | X3) P(X3 | X1) P(X4 | X2) P(X5 | X4) It can be redrawn like this:

Artificial Intelligence