960 likes | 1.15k Views
Solving problems using propositional logic. Need to write what you know as propositional formulas Theorem proving will then tell you whether a given new sentence will hold given what you know Three kinds of queries
E N D
Solving problems using propositional logic • Need to write what you know as propositional formulas • Theorem proving will then tell you whether a given new sentence will hold given what you know • Three kinds of queries • Is my knowledgebase consistent? (i.e. is there at least one world where everything I know is true?) Satisfiability • Is the sentence S entailed by my knowledge base? (i.e., is it true in every world where my knowledge base is true?) • Is the sentence S consistent/possibly true with my knowledge base? (i.e., is S true in at least one of the worlds where my knowledge base holds?) • S is consistent if ~S is not entailed • But cannot differentiate between degrees of likelihood among possible sentences
Pearl lives in Los Angeles. It is a high-crime area. Pearl installed a burglar alarm. He asked his neighbors John & Mary to call him if they hear the alarm. This way he can come home if there is a burglary. Los Angeles is also earth-quake prone. Alarm goes off when there is an earth-quake. Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls If there is a burglary, will Mary call? Check KB & E |= M If Mary didn’t call, is it possible that Burglary occurred? Check KB & ~M doesn’t entail ~B Example
Pearl lives in Los Angeles. It is a high-crime area. Pearl installed a burglar alarm. He asked his neighbors John & Mary to call him if they hear the alarm. This way he can come home if there is a burglary. Los Angeles is also earth-quake prone. Alarm goes off when there is an earth-quake. Pearl lives in real world where (1) burglars can sometimes disable alarms (2) some earthquakes may be too slight to cause alarm (3) Even in Los Angeles, Burglaries are more likely than Earth Quakes (4) John and Mary both have their own lives and may not always call when the alarm goes off (5) Between John and Mary, John is more of a slacker than Mary.(6) John and Mary may call even without alarm going off Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls If there is a burglary, will Mary call? Check KB & E |= M If Mary didn’t call, is it possible that Burglary occurred? Check KB & ~M doesn’t entail ~B John already called. If Mary also calls, is it more likely that Burglary occurred? You now also hear on the TV that there was an earthquake. Is Burglary more or less likely now? Example (Real)
Omniscient & Eager way: Model everything! E.g. Model exactly the conditions under which John will call He shouldn’t be listening to loud music, he hasn’t gone on an errand, he didn’t recently have a tiff with Pearl etc etc. A & c1 & c2 & c3 &..cn => J (alsothe exceptions may have interactions c1&c5 => ~c9 ) Ignorant (non-omniscient) and Lazy (non-omnipotent) way: Model the likelihood In 85% of the worlds where there was an alarm, John will actually call How do we do this? Non-monotonic logics “certainty factors” “fuzzy logic” “probability” theory? How do we handle Real Pearl? Potato in the tail-pipe Qualification and Ramification problems make this an infeasible enterprise
Non-monotonic (default) logic • Prop calculus (as well as the first order logic we shall discuss later) are monotonic, in that once you prove a fact F to be true, no amount of additional knowledge can allow us to disprove F. • But, in the real world, we jump to conclusions by default, and revise them on additional evidence • Consider the way the truth of the statement “F: Tweety Flies” is revised by us when we are given facts in sequence: 1. Tweety is a bird (F)2. Tweety is an Ostritch (~F) 3. Tweety is a magical Ostritch (F) 4. Tweety was cursed recently (~F) 5. Tweety was able to get rid of the curse (F) • How can we make logic show this sort of “defeasible” (aka defeatable) conclusions? • Many ideas, with one being negation as failure • Let the rule about birds be Bird & ~abnormal => Fly • The “abnormal” predicate is treated special— if we can’t prove abnormal, we can assume ~abnormal is true • (Note that in normal logic, failure to prove a fact F doesn’t allow us to assume that ~F is true since F may be holding in some models and not in other models). • Non-monotonic logic enterprise involves (1) providing clean semantics for this type of reasoning and (2) making defeasible inference efficient
Certainty Factors • Associate numbers to each of the facts/axioms • When you do derivations, compute c.f. of the results in terms of the c.f. of the constituents (“truth functional”) • Problem: Circular reasoning because of mixed causal/diagnostic directions • Raining => Grass-wet 0.9 • Grass-wet => raining 0.7 • If you know grass-wet with 0.4, then we know raining which makes grass more wet, which….
Fuzzy Logic assumes that the word is made of statements that have different grades of truth Recall the puppy example Fuzzy Logic is “Truth Functional”—i.e., it assumes that the truth value of a sentence can be established in terms of the truth values only of the constituent elements of that sentence. PPL assumes that the world is made up of statements that are either true or false PPL is truth functional for “truth value in a given world” but not truth functional for entailment status. Fuzzy Logic vs. Prob. Prop. Logic
Prop. Logic vs. Prob. Prop. Logic • Model theory for logic • Set of all worlds where the formula/KB holds • Each world is either a model or not • Proof theory/inference • Rather than enumerate all models, write KB sentences that constrain the models • Model theory for prob. Logic • For each world, give the probabilityp that the formula holds in that world • p(w)=0 means that world is definitely not a model • Otherwise, 0< p(w) <= 1 • Sum of p(w) = 1 • AKA Joint Probability Distribution—2n – 1 numbers • Proof Theory • statements on subsets of propositions • E.g. p(A)=0.5; p(B|A)=0.7 etc • (Cond) Independences… • These constrain the joint probability distribution
If in addition, each proposition is equally likely to be true or false, Then the joint probability distribution can be specified without giving any numbers! All worlds are equally probable! If there are n props, each world will be 1/2n probable Probability of any propositional conjunction with m (< n) propositions will be 1/2m If there are no relations between the propositions (i.e., they can take values independently of each other) Then the joint probability distribution can be specified in terms of probabilities of each proposition being true Just n numbers instead of 2n Easy Special Cases Is this a good world to live in?
Will we always need 2n numbers? • If every pair of variables is independent of each other, then • P(x1,x2…xn)= P(xn)* P(xn-1)*…P(x1) • Need just n numbers! • But if our world is that simple, it would also be very uninteresting & uncontrollable(nothing is correlated with anything else!) • We need 2n numbers if every subset of our n-variables are correlated together • P(x1,x2…xn)= P(xn|x1…xn-1)* P(xn-1|x1…xn-2)*…P(x1) • But that is too pessimistic an assumption on the world • If our world is so interconnected we would’ve been dead long back… A more realistic middle ground is that interactions between variables are contained to regions. --e.g. the “school variables” and the “home variables” interact only loosely (are independent for most practical purposes) -- Will wind up needing O(2k) numbers (k << n)
Suppose we know the likelihood of each of the (propositional) worlds (aka Joint Probability distribution) Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you) So, Joint Probability Distribution is all that you ever need! In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers) --In general 2n separate numbers (which should add up to 1) If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with? --Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers ---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take! Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls Probabilistic Calculus to the Rescue Topology encodes the conditional independence assertions Only 10 (instead of 32) numbers to specify!
Suppose we know the likelihood of each of the (propositional) worlds (aka Joint Probability distribution) Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you) So, Joint Probability Distribution is all that you ever need! In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers) --In general 2n separate numbers (which should add up to 1) If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with? --Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers ---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take! Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls Probabilistic Calculus to the Rescue Only 10 (instead of 32) numbers to specify!
If you know the full joint, You can answer ANY query
Your name: Computing Prob. Queries Given a Joint Distribution P(CA & TA) = P(CA) = P(TA) = P(CA V TA) = P(CA|~TA) = P(~TA|CA) = Check if P(CA|~TA) = P(~TA|CA)* P(CA)
P(CA & TA) = 0.04 P(CA) = 0.04+0.06 = 0.1 (marginalizing over TA) P(TA) = 0.04+0.01= 0.05 P(CA V TA) = P(CA) + P(TA) – P(CA&TA) = 0.1+0.05-0.04 = 0.11 P(CA|~TA) = P(CA&~TA)/P(~TA) = 0.06/(0.06+.89) = .06/.95=.0631 Think of this as analogous to entailment by truth-table enumeration!
The material in this slide was presented on the white board If B=>A then P(A|B) = ? P(B|~A) = ? P(B|A) = ?
P(CA|TA;Catch) P(CA|TA) P(CA) Most useful probabilistic reasoning involves computing posterior distributions P(CA|TA;Catch;Romney-won)? Probability Variable values Important: Computing posterior distribution is inference; not learning
The material in this slide was presented on the white board CONDITIONAL PROBABLITIES Non-monotonicity w.r.t. evidence– P(A|B) can be either higher, lower or equal to P(A)
Generalized bayes rule P(A|B,e) = P(B|A,e) P(A|e) P(B|e) The material in this slide was presented on the white board Get by with easier to assess numbers Think of this as analogous to inference rules (like modus-ponens) A be Anthrax; Rn be Runny Nose P(A|Rn) = P(Rn|A) P(A)/ P(Rn)
Joint distribution requires us to assess probabilities of type P(x1,~x2,x3,….~xn) This means we have to look at all entities in the world and see which fraction of them have x1,~x2,x3….~xm true Difficult experiment to setup.. Conditional probabilities of type P(A|B) are relatively easier to assess You just need to look at the set of entities having B true, and look at the fraction of them that also have A true Eventually, they too can get baroque P(x1,~x2,…xm|y1..yn) The material in this slide was presented on the white board Doc, Doc, I have flu. Can you tell if I have a runny nose? Relative ease/utility of Assessing various types of probabilities • Among the conditional probabilities, causal probabilities of the form P(effect|cause) are better to assess than diagnostic probabilities of the form P(cause|effect) • Causal probabilities tend to me more stable compared to diagnostic probabilities • (for example, a text book in dentistry can publish P(TA|Cavity) and hope that it will hold in a variety of places. In contrast, P(Cavity|TA) may depend on other fortuitous factors—e.g. in areas where people tend to eat a lot of icecream, many tooth aches may be prevalent, and few of them may be actually due to cavities.
The material in this slide was presented on the white board Need to know this! If n evidence variables, We will need 2n probabilities! What happens if there are multiple symptoms…? Conditional independence To the rescue Suppose P(TA,Catch|cavity) = P(TA|Cavity)*P(Catch|Cavity) Patient walked in and complained of toothache You assess P(Cavity|Toothache) Now you try to probe the patients mouth with that steel thingie, and it catches… How do we update our belief in Cavity? P(Cavity|TA, Catch) = P(TA,Catch| Cavity) * P(Cavity) P(TA,Catch) = a P(TA,Catch|Cavity) * P(Cavity)
By conditional independence inherent in the Bayes Net Put variables in reverse topological sort; apply chain rule P(J|M,A,~B,~E)*P(M|A,~B,~E)*P(A|~B,~E)*P(~B|~E)*P(~E) P(J|A) * P(M|A) *P(A|~B,~E) * P(~B) * P(~E) .9 .7 .001 .999 .998 Local Semantics Node independent of non-descendants given its parents Gives global semantics i.e. the full joint .000628
Conditional Independence Assertions • We write X || Y | Z to say that the set of variables X is conditionally independent of the set of variables Y given evidence on the set of variables Z (where X,Y,Z are subsets of the set of all random variables in the domain model) • Inference can exploit conditional independence assertions. Specifically, • X || Y| Z implies • P(X & Y|Z) = P(X|Z) * P(Y|Z) • P(X|Y, Z) = P(X|Z) • P(Y|X,Z) = P(Y|Z) • If A||B|C then P(A,B,C)=P(A|B,C)P(B,C) =P(A|B,C)P(B|C)P(C) =P(A|C)P(B|C)P(C) (Can get by with 1+2+2=5 numbers instead of 8) Why not write down all conditional independence assertions that hold in a domain?
Cond. Indep. Assertions (Contd) • Idea: Why not write down all conditional independence assertions (CIA) (X || Y | Z) that hold in a domain? • Problem: There can be exponentially many conditional independence assertions that hold in a domain (recall that X, Y and Z are all subsets of the domain variables). • Brilliant Idea: May be we should implicitly specify the CIA by writing down the “local dependencies” between variables using a graphical model • A Bayes Network is a way of doing just this. • The Bayes Net is a Directed Acyclic Graph whose nodes are random variables, and the immediate dependencies between variables are represented by directed arcs • The topology of a bayes network shows the inter-variable dependencies. Given the topology, there is a way of checking if any Cond. Indep. Assertion. holds in the network (the Bayes Ball algorithm and the D-Sep idea)
Topological Semantics Independence from Every node holds Given markov blanket Independence from Non-descedants holds Given just the parents Markov Blanket Parents; Children; Children’s other parents These two conditions are equivalent Many other conditional indepdendence assertions follow from these
CIA implicit in Bayes Nets • So, what conditional independence assumptions are implicit in Bayes nets? • Local Markov Assumption: • A node N is independent of its non-descendants (including ancestors) given its immediate parents. (So if P are the immediate paretnts of N, and A is a subset of of Ancestors and other non-descendants, then {N} || A| P ) • (Equivalently) A node N is independent of all other nodes given its markov blanket (parents, children, children’s parents) • Given this assumption, many other conditional independencies follow. For a full answer, we need to appeal to D-Sep condition and/or Bayes Ball reachability
Alarm Burglary Earthquake Introduce variables in the causal order Easy when you know the causality of the domain hard otherwise.. P(A|J,M) =P(A)? How many probabilities are needed? 13 for the new; 10 for the old Is this the worst?
Continuing bad friends, in the question above, suppose a second friend comes along and says that he can give you the conditional probabilities that you want to complete the specification of your bayes net. You ask him a CPT entry, and pat comes a response--some number between 0 and 1. This friend is well meaning, but you are worried that the numbers he is giving may lead to some sort of inconsistent joint probability distribution. Is your worry justified ( i.e., can your friend give you numbers that can lead to an inconsistency?) (To understand "inconsistency", consider someone who insists on giving you P(A), P(B), P(A&B) as well as P(AVB) and they wind up not satisfying the P(AVB)= P(A)+P(B) -P(A&B)[or alternately, they insist on giving you P(A|B), P(B|A), P(A) and P(B), and the four numbers dont satisfy the bayes rule] Answer: No—as long as we only ask the friend to fill up the CPTs in the bayes network, there is no way the numbers won’t makeup a consistent joint probability distribution This should be seen as a feature.. Personal Probabilities John may be an optimist and believe that P(burglary)=0.01 and Tom may be a pessimist and believe that P(burglary)=0.99 Bayesians consider both John and Tom to be fine (they don’t insist on an objective frequentist interpretation for probabilites) However, Bayesians do think that John and Tom should act consistently with their own beliefs For example, it makes no sense for John to go about installing tons of burglar alarms given his belief, just as it makes no sense for Tom to put all his valuables on his lawn Bayesian (Personal) Probabilities
Problem 1: Joint distribution requires 2nnumbers to specify; and those numbers are harder to assess Problem 2: But, CPTs will be as big as the full joint if the network is dense CPTs Problem 3: But, CPTs can still be quite hard to specify if there are too many parents (or if the variables are continuous) Solution: Use Bayes Nets to reduce the numbers and specify them as CPTs Solution: Introduce intermediate variables to induce sparsity into the network Solution: Parameterize the CPT (use Noisy OR etc for discrete variables; gaussian etc for continuous variables) Ideas for reducing the number of probabilties to be specified
Making the network Sparse by introducing intermediate variables • Consider a network of boolean variables where n parent nodes are connected to m children nodes (with each parent influencing each child). • You will need n + m*2nconditional probabilities • Suppose you realize that what is really influencing the child nodes is some single aggregate function on the parent’s values (e.g. sum of the parents). • We can introduce a single intermediate node called “sum” which has links from all the n parent nodes, and separately influences each of the m child nodes • Now you will wind up needing only n+ 2n + 2m conditional probabilities to specify this new network!
Learning such hidden variables from data poses challenges..
Compact/Parameterized distributions are pretty much the only way to go when continuous variables are involved!
Prob that X holds even though ith parent doesn’t How about Noisy And? (hint: A&B => ~( ~A V ~B) ) k ri i=j+1 Think of a firing squad with upto k gunners trying to shoot you You will live only if everyone who shoots misses.. We only consider the failure to cause probability of the Causes that hold
Constructing Belief Networks: Summary • [[Decide on what sorts of queries you are interested in answering • This in turn dictates what factors to model in the network • Decide on a vocabulary of the variables and their domains for the problem • Introduce “Hidden” variables into the network as needed to make the network “sparse” • Decide on an order of introduction of variables into the network • Introducing variables in causal direction leads to fewer connections (sparse structure) AND easier to assess probabilities • Try to use canonical distributions to specify the CPTs • Noisy-OR • Parameterized discrete/continuous distributions • Such as Poisson, Normal (Gaussian) etc
Case Study: Pathfinder System • Domain: Lymph node diseases • Deals with 60 diseases and 100 disease findings • Versions: • Pathfinder I: A rule-based system with logical reasoning • Pathfinder II: Tried a variety of approaches for uncertainity • Simple bayes reasoning outperformed • Pathfinder III: Simple bayes reasoning, but reassessed probabilities • Parthfinder IV: Bayesian network was used to handle a variety of conditional dependencies. • Deciding vocabulary: 8 hours • Devising the topology of the network: 35 hours • Assessing the (14,000) probabilities: 40 hours • Physician experts liked assessing causal probabilites • Evaluation: 53 “referral” cases • Pathfinder III: 7.9/10 • Pathfinder IV: 8.9/10 [Saves one additional life in every 1000 cases!] • A more recent comparison shows that Pathfinder now outperforms experts who helped design it!!
Conditional Independence Assertions • We write X || Y | Z to say that the set of variables X is conditionally independent of the set of variables Y given evidence on the set of variables Z (where X,Y,Z are subsets of the set of all random variables in the domain model) • Inference can exploit conditional independence assertions. Specifically, • X || Y| Z implies • P(X & Y|Z) = P(X|Z) * P(Y|Z) • P(X|Y, Z) = P(X|Z) • P(Y|X,Z) = P(Y|Z) • If A||B|C then P(A,B,C)=P(A|B,C)P(B,C) =P(A|B,C)P(B|C)P(C) =P(A|C)P(B|C)P(C) (Can get by with 1+2+2=5 numbers instead of 8) Why not write down all conditional independence assertions that hold in a domain?
Topological Semantics Independence from Every node holds Given markov blanket Independence from Non-descedants holds Given just the parents Markov Blanket Parents; Children; Children’s other parents These two conditions are equivalent Many other conditional indepdendence assertions follow from these
Independence in Bayes Networks:Causal Chains; Common Causes; Common Effects Common Cause (diverging) X and Y are caused by Z is blocked if Z is given Causal chain (linear) X causes Y through Z is blocked if Z is given Common Effect (converging) X and Y cause Z is blocked only if neither Z nor its descendants are given
D-sep (direction dependent Separation) • X || Y | E if every undirected path from X to Y is blocked by E • A path is blocked if there is a node Z on the path s.t. • [Z]Z is in E and Z has one arrow coming in and another going out • [Z] is in E and Z has both arrows going out • [Z] Neither Z nor any of its descendants are in E and both path arrows lead to Z B||M|A (J,M)||E | A B||E B||E | A B||E | M
Topological Semantics Independence from Every node holds Given markov blanket Independence from Non-descedants holds Given just the parents Markov Blanket Parents; Children; Children’s other parents Convince yourself that these conditions are special cases of D-Sep These two conditions are equivalent Many other conditional indepdendence assertions follow from these
V||A | T T ||L ? T||L | D X||D X||D | A (V,S) || X | A (V,S)||(X,D)| A Smoking Visit to Asia Tuberculosis Lung Cancer Abnormality in Chest Bronchitis Dyspnea X-Ray How many probabilities? (assume all are boolean variables) P(V,~T,S,~L,A,B,~X,~D) = • “Asia” network: Name:
0th idea for Bayes Net Inference • Given a bayes net, we can compute all the entries of the joint distribution (by just multiplying entries in CPTs) • Given the joint distribution, we can answer any probabilistic query. • Ergo, we can do inference on bayes networks • Qn: Can we do better? • Ideas: • Implicity enumerate only the part of the joint that is needed • Use sampling techniques to compute the probabilities