520 likes | 1.05k Views
Bayesian Network. Introduction. Independence assumptions Seems to be necessary for probabilistic inference to be practical. Naïve Bayes Method Makes independence assumptions that are often not true Also called Idiot Bayes Method for this reason. Bayesian Network
E N D
Introduction • Independence assumptions • Seems to be necessary for probabilistic inference to be practical. • Naïve Bayes Method • Makes independence assumptions that are often not true • Also called Idiot Bayes Method for this reason. • Bayesian Network • Explicitly models the independence relationships in the data. • Use these independence relationships to make probabilistic inferences. • Also known as: Belief Net, Bayes Net, Causal Net, …
Battery Gas Start Why Bayesian Networks? • Intuitive language • Can utilize causal knowledge in constructing models • Domain experts comfortable building a network • General purpose “inference” algorithms • P(Bad Battery | Has Gas, Won’t Start) • Exact: Modular specification leads to large computational efficiencies
Random Variables • A random variable is a set of exhaustive and mutually exclusive possibilities. • Example: • throwing a die: • small {1,2} • medium: {3,4} • large: {5,6} • Medical data • patient’s age • blood pressure • Variable vs. Event a variable taking a value = an event.
Independence of Variables • Instantiation of a variable is an event. • A set of variables are independent iff all possible instantiations of the variables are independent. • Example: X: patient blood pressure {high, medium, low} Y: patient sneezes {yes, no} P(X=high, Y=yes) = P(X=high) x P(Y=yes) P(X=high, Y=no) = P(X=high) x P(Y=no) ... ... P(X=low, Y=yes) = P(X=low) x P(Y=yes) P(X=low, Y=no) = P(X=low) x P(Y=no) • Conditional independence between a set of variables holds iff the conditional independence between all possible instantiations of the variables holds.
Bayesian Networks: Definition • Bayesian networks are directed acyclic graphs (DAGs). • Nodes in Bayesian networks represent random variables, which is normally assumed to take on discrete values. • The links of the network represent direct probabilistic influence. • The structure of the network represents the probabilistic dependence/independence relationships between the random variables represented by the nodes.
Bayesian Network: Probabilities • The nodes and links are quantified with probability distributions. • The root nodes (those with no ancestors) are assigned prior probability distributions. • The other nodes are assigned with the conditional probability distribution of the node given its parents.
Example Conditional Probability Tables (CPTs)
Noisy-OR-Gate • Exception Independence Noisy-OR-Gate—the exceptions to the causations are independent. P(E|C1, C2)=1-(1-P(E|C1))(1-P(E|C2))
Inference in Bayesian Networks • Given a Bayesian network and its CPTs, we can compute probabilities of the following form: P(H | E1, E2,... ... , En) where H, E1, E2,... ... , En are assignments to nodes (random variables) in the network. • Example: The probability of family-out given lights out and hearing bark: P( fo |Ø lo, hb).
Semantics of Belief Networks • Two ways understanding the semantics of belief network • Representation of the joint probability distribution • Encoding of a collection of conditional independence statements
Y1 Y2 X Non-descendent Terminology Ancestor Parent Non-descendent Descendent
Connection Pattern and Independence • Linear connection: The two end variables are usually dependent on each other. The middle variable renders them independent. • Converging connection: The two end variables are usually independent on each other. The middle variable renders them dependent. • Divergent connection: The two end variables are usually dependent on each other. The middle variable renders them independent.
D-Separation • A variable a is d-separated from b by a set of variables E if there does not exist a d-connecting path between a and b such that • None of its linear or diverging nodes is in E • For each of the converging nodes, either it or one of its descendents is in E. • Intuition: • The influence between a and b must propagate through a d-connecting path
If a and b are d-separated by E, then they are conditionally independent of each other given E: P(a, b | E) = P(a | E) x P(b | E)
Chain Rule • A joint probability distribution can be expressed as a product of conditional probabilities: P(X1, X2, ..., Xn) =P(X1) x P(X2, X1, ..., Xn | X1) =P(X1) x P(X2|X1) x P(X3, X4, ..., Xn| X1, X2) =P(X1) x P(X2|X1) x P(X3|X1,X2) x P(X4, ..., Xn|X1,X2,X3)... ... = P(X1) x P(X2|X1) x P(X3|X1, X2)x ...x P(Xn| X1, ...Xn-1) This has nothing to do with any independence assumption!
Compute the Joint Probability • Given a Bayesian network, let X1, X2, ..., Xn be an ordering of the nodes such that only the nodes that are indexed lower than i may have directed path to Xi. • Since the parents of Xi, d-separates Xi and all the other nodes that are indexed lower than i, P(Xi| X1, ...Xi-1)=P(Xi| parents(Xi)) This probability is available in the Bayesian network. • Therefore, P(X1, X2, ..., Xn) can be computed from the probabilities available in Beyesian network.
What can Bayesian Networks Compute? • The inputs to a Bayesian Network evaluation algorithm is a set of evidences: e.g., E = { hear-bark=true, lights-on=true } • The outputs of Bayesian Network evaluation algorithm are P(Xi=v | E) where Xi is an variable in the network. • For example: P(family-out=true| E) is the probability of family being out given hearing dog's bark and seeing the lights on.
Computation in Bayesian Networks • Computation in Bayesian networks is NP-hard. All algorithms for computing the probabilities are exponential to the size of the network. • There are two ways around the complexity barrier: • Algorithms for special subclass of networks, e.g., singly connected networks. • Approximate algorithms. • The computation for singly connected graph is linear to the size of the network.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Bayesian Network Inference • Inference: calculating P(X |Y ) for some variables or sets of variables X and Y. • Inference in Bayesian networks is #P-hard! Inputs: prior probabilities of .5 I1 I2 I3 I4 I5 Reduces to O P(O) must be (#sat. assign.)*(.5^#inputs) How many satisfying assignments?
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Bayesian Network Inference • But…inference is still tractable in some cases. • Let’s look a special class of networks: trees / forests in which each node has at most one parent.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Decomposing the probabilities • Suppose we want P(Xi | E ) where E is some set of evidence variables. • Let’s split E into two parts: • Ei- is the part consisting of assignments to variables in the subtree rooted at Xi • Ei+ is the rest of it Xi
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Decomposing the probabilities Xi • Where: • a is a constant independent of Xi • p(Xi) = P(Xi |Ei+) • l(Xi) = P(Ei-| Xi)
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Using the decomposition for inference • We can use this decomposition to do inference as follows. First, compute l(Xi) = P(Ei-| Xi)for all Xi recursively, using the leaves of the tree as the base case. • If Xi is a leaf: • If Xi is in E : l(Xi) = 0 if Xi matches E, 1 otherwise • If Xi is not in E : Ei- is the null set, so P(Ei-| Xi) = 1 (constant)
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Quick aside: “Virtual evidence” • For theoretical simplicity, but without loss of generality, let’s assume that all variables in E (the evidence set) are leaves in the tree. • Why can we do this WLOG: Equivalent to Xi Xi Observe Xi Xi’ Observe Xi’ Where P(Xi’|Xi) =1 if Xi’=Xi, 0 otherwise
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Calculating l(Xi) for non-leaves Xi • Suppose Xi has one child, Xj • Then: Xj
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Calculating l(Xi) for non-leaves • Now, suppose Xihas a set of children, C. • Since Xid-separates each of its subtrees, the contribution of each subtree to l(Xi) is independent: where lj(Xi) is the contribution to P(Ei-| Xi)of the part of the evidence lying in the subtree rooted at one of Xi’s children Xj.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt We are now l-happy • So now we have a way to recursively compute all the l(Xi)’s, starting from the root and using the leaves as the base case. • If we want, we can think of each node in the network as an autonomous processor that passes a little “l message” to its parent. l l l l l l
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt The other half of the problem • Remember, P(Xi|E) = ap(Xi)l(Xi). Now that we have all the l(Xi)’s, what about the p(Xi)’s? • p(Xi) = P(Xi |Ei+). • What about the root of the tree, Xr? In that case, Er+ is the null set, so p(Xr) = P(Xr). No sweat. Since we also know l(Xr), we can compute the final P(Xr). • So for an arbitrary Xi with parent Xp, let’s inductively assume we know p(Xp) and/or P(Xp|E). How do we get p(Xi)?
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Computing p(Xi) Where pi(Xp) is defined as Xp Xi
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt We’re done. Yay! • Thus we can compute all the p(Xi)’s, and, in turn, all the P(Xi|E)’s. • Can think of nodes as autonomous processors passing l and p messages to their neighbors l l p p l l l l p p p p
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Conjunctive queries • What if we want, e.g., P(A, B | C) instead of just marginal distributions P(A | C) and P(B | C)? • Just use chain rule: • P(A, B | C) = P(A | C) P(B | A, C) • Each of the latter probabilities can be computed using the technique just discussed.
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Polytrees • Technique can be generalized to polytrees: undirected versions of the graphs are still trees, but nodes can have more than one parent
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Dealing with cycles • Can deal with undirected cycles in graph by • clustering variables together • Conditioning A A B C BC D D Set to 1 Set to 0
www.cs.cmu.edu/~awm/381/lec/bayesinfer/bayesinf.ppt Join trees • Arbitrary Bayesian network can be transformed via some evil graph-theoretic magic into a join tree in which a similar method can be employed. ABC A B C BCD BCD E D G DF F In the worst case the join tree nodes must take on exponentially many combinations of values, but often works well in practice