580 likes | 724 Views
Bayes Nets and Probabilities. Oliver Schulte Machine Learning 726. Bayes Nets: General Points. Represent domain knowledge . Allow for uncertainty . Complete representation of probabilistic knowledge. Represent causal relations. Fast answers to types of queries:
E N D
Bayes Nets and Probabilities Oliver Schulte Machine Learning 726
Bayes Nets: General Points • Represent domain knowledge. • Allow for uncertainty. • Complete representation of probabilistic knowledge. • Represent causal relations. • Fast answers to types of queries: • Probabilistic: What is the probability that a patient has strep throat given that they have fever? • Relevance: Is fever relevant to having strep throat?
Bayes Net Links • Judea Pearl's Turing Award • See UBC’s AISpace
Random Variables • A random variable has a probability associated with each of its values. • A basic statement assigns a value to a random variable.
Probability for Sentences • A sentence or query is formed by using “and”, “or”, “not” recursively with basic statements. • Sentences also have probabilities assigned to them.
Probability Notation • Often probability theorists write A,B instead of A B (like Prolog). • If the intended random variables are known, they are often not mentioned.
Axioms of probability Sentences considered as sets of complete assignments For any sentence A, B • 0 ≤ P(A) ≤ 1 • P(true) = 1 and P(false) = 0 • P(A B) = P(A) + P(B) - P(AB) • P(A) = P(B) if A and B are logically equivalent.
The Logical Equivalence Pattern Rule 1: Logically equivalent expressions have the same probability.
Prove the Pattern: Marginalization • Theorem. P(A) = P(A,B) + P(A, not B) • Proof. • A is logically equivalent to [A and B) (A and not B)]. • P(A) = P([A and B) (A and not B)]) =P(A and B) + P(A and not B) – P([A and B) (A and not B)]). Disjunction Rule. • [A and B) (A and not B)] is logically equivalent to false, so P([A and B) (A and not B)]) =0. • So 2. implies P(A) = P(A and B) + P(A and not B).
Completeness of Bayes Nets • A probabilistic query system is complete if it can compute a probability for every sentence. • Proposition: A Bayes net is complete. Proof has two steps. • Any system that encodes the joint distribution is complete. • A Bayes net encodes the joint distribution.
Assigning Probabilities to Sentences • A complete assignment is a conjunctive sentence that assigns a value to each random variable. • The joint probability distribution specifies a probability for each complete assignment. • A joint distribution determines an probability for every sentence. • How? Spot the pattern.
Inference by enumeration • Marginalization: For any sentence A, sum the joint probabilities for the complete assignments where A is true. • P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2.
Completeness Proof for Joint Distribution • Theorem [from propositional logic] Every sentence is logically equivalent to a disjunction of the formA1 or A2 or ... or Akwhere the Ai are complete assignments. • All of the Ai are mutually exclusive (joint probability 0). Why? • So if S is equivalent to A1 or A2 or ... or Ak, thenP(S) = ΣiP(Ai)where each Ai is given by the joint distribution.
The Story • You have a new burglar alarm installed at home. • It’s reliable at detecting burglary but also responds to earthquakes. • You have two neighbors that promise to call you at work when they hear the alarm. • John always calls when he hears the alarm, but sometimes confuses alarm with telephone ringing. • Mary listens to loud music and sometimes misses the alarm.
Computing The Joint Distribution • A Bayes net provides a compact factored representation of a joint distribution. • In words, the joint probability is computed as follows. • For each node Xi: • Find the assigned value xi. • Find the values y1,..,yk assigned to the parents of Xi. • Look up the conditional probability P(xi|y1,..,yk) in the Bayes net. • Multiply together these conditional probabilities.
Product Formula Example: Burglary • Query: What is the joint probability that all variables are true? • P(M, J, A, E, B) = P(M|A) p(J|A) p(A|E,B)P(E)P(B)= .7 x .9 x .95 x .002 x .001
Compactness of Bayesian Networks • Consider n binary variables • Unconstrained joint distribution requires O(2n) probabilities • If we have a Bayesian network, with a maximum of k parents for any node, then we need O(n 2k) probabilities • Example • Full unconstrained joint distribution • n = 30: need 230probabilities for full joint distribution • Bayesian network • n = 30, k = 4: need 480 probabilities
Summary: Why are Bayes nets useful? - Graph structure supports - Modular representation of knowledge - Local, distributed algorithms for inference and learning - Intuitive (possibly causal) interpretation • - Factored representation may have exponentially fewer parameters than full joint P(X1,…,Xn) => • lower sample complexity (less data for learning) • lower time complexity (less time for inference)
Is it Magic? • How can the Bayes net reduce parameters?By exploiting conditional independencies. • Why does the product formula work? • The Bayes net topological or graphical semantics. • The graph by itself entails conditional independencies. • The Chain Rule.
Conditional Probabilities: Intro • Given (A) that a die comes up with an odd number, what is the probability that (B) the number is • a 2 • a 3 • Answer: the number of cases that satisfy both A and B, out of the number of cases that satisfy A. • Examples: • #faces with (odd and 2)/#faces with odd= 0 / 3 = 0. • #faces with (odd and 3)/#faces with odd= 1 / 3.
Conditional Probsctd. • Suppose that 50 students are taking 310 and 30 are women. Given (A) that a student is taking 310, what is the probability that (B) they are a woman? • Answer: #students who take 310 and are a woman/#students in 310 = 30/50 = 3/5. • Notation: P(A|B)
Conditional Ratios: Spot the Pattern • Spot the Pattern
Conditional Probs: The Ratio Pattern • Spot the Pattern P(A|B) = P(A and B)/ P(B) Important!
Conditional Probabilities: Motivation • Much knowledge can be represented as implications B1,..,Bk =>A. • Conditional probabilities are a probabilistic version of reasoning about what follows from conditions. • Cognitive Science: Our minds store implicational knowledge.
Independence • A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B) • Suppose that Weather is independent of the Cavity Scenario. Then the joint distribution decomposes: P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) • Absolute independence powerful but rare.
Exercise • Prove that the three definitions of independence are equivalent (assuming all positive probabilities). • A and B are independent iff • P(A|B) = P(A) • or P(B|A) = P(B) • or P(A, B) = P(A) P(B)
Conditional independence • If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: (1) P(catch | toothache, cavity) = P(catch | cavity) • The same independence holds if I haven't got a cavity:(2) P(catch | toothache,cavity) = P(catch | cavity) • Catch is conditionally independent of Toothache given Cavity:P(Catch | Toothache,Cavity) = P(Catch | Cavity) • The equivalences for independence also holds for conditional independence, e.g.: P(Toothache | Catch, Cavity) = P(Toothache | Cavity) P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
Common Causes: Spot the Pattern Cavity Catch toothache • Catch is independent of toothache given Cavity.
Burglary Example • JohnCalls, MaryCallsare conditionally independent given Alarm.
Spot the Pattern: Chain Scenario • MaryCalls is independent of Burglary given Alarm. • JohnCalls is independent of Earthquake given Alarm.
The Markov Condition • A Bayes net is constructed so that:each variable is conditionally independent of its nondescendants given its parents. • The graph alone (without specified probabilities) entails conditional independencies. • Causal Interpretation: Each parent is a direct cause.
The Chain Rule • We can always write P(a, b, c, … z) = P(a | b, c, …. z) P(b, c, … z) (Product Rule) • Repeatedly applying this idea, we obtain P(a, b, c, … z) = P(a | b, c, …. z) P(b | c,.. z) P(c| .. z)..P(z) • Order the variables such that children come before parents. • Then given its parents, each node is independent of its other ancestors by the topological independence. • P(a,b,c, … z) = Πx. P(x|parents)
Example in Burglary Network • P(M, J,A,E,B) = P(M| J,A,E,B)p(J,A,E,B)= P(M|A)p(J,A,E,B) = P(M|A) p(J|A,E,B)p(A,E,B) = P(M|A) p(J|A)p(A,E,B) = P(M|A) p(J|A) p(A|E,B) P(E,B) = P(M|A) p(J|A) p(A|E,B) P(E)P(B) • Colours show applications of the Bayes net topological independence.
Common Effects: Spot the Pattern • Influenza and Smokes are independent. • Given Bronchitis, they become dependent. Influenza Smokes Bronchitis Battery Age Charging System OK • Battery Age and Charging System are independent. • Given Battery Voltage, they become dependent. Battery Voltage
A B C Conditioning on Children • Independent Causes: • A and B are independent. • “Explaining away” effect: • Given C, observing A makes B less likely. • E.g. Bronchitis in UBC “Simple Diagnostic Problem”. • A and B are (marginally) independent, become dependent once C is known.
D-separation • A, B, and C are non-intersecting subsets of nodes in a directed graph. • A path from A to B is blocked if it contains a node such that either • the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or • the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. • If all paths from A to B are blocked, A is said to be d-separated from B by C. • If A is d-separated from B by C, the joint distribution over all variables in the graph satisfies .