980 likes | 1k Views
This chapter explores Bayesianism versus Frequentism in knowledge representation and reasoning. It discusses classical probability and Bayesian probability, as well as Bayesian knowledge representation and reasoning. It also delves into Bayesian terminology and the application of Bayes' rule in spam recognition.
E N D
Bayesian approaches to knowledge representation and reasoningPart 1(Chapter 13)
Bayesianism vs. Frequentism • Classical probability: Frequentists • Probability of a particular event is defined relative to its frequency in a sample space of events. • E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses.
Bayesian probability: • Combine measure of “prior” belief you have in a proposition with your subsequent observations of events. • Example: Bayesian can assign probability to statement “The first e-mail message ever written was not spam” but frequentist cannot.
Bayesian Knowledge Representation and Reasoning • Question: Given the data D and our prior beliefs, what is the probability that h is the correct hypothesis? (spam example)
Bayesian terminology(example -- spam recognition) • Random variableX: returns one of a set of values {x1, x2, ...,xm}, or a continuous value in interval [a,b] with probability distribution D(X). • DataD: {v1, v2, v3, ...} Set of observed values of random variables X1, X2, X3, ...
Hypothesish: Function taking instance j and returning classification of j (e.g., “spam” or “not spam”). • Space of hypothesesH: Set of all possible hypotheses
Prior probability of h: • P(h): Probability that hypothesis h is true given our prior knowledge • If no prior knowledge, all h H are equally probable • Posterior probability of h: • P(h|D): Probability that hypothesis h is true, given the data D. • Likelihood of D: • P(D|h): Probability that we will see data D, given hypothesis h is true.
X Y event space Recall definition of conditional probability: Event space = all e-mail messages X = all spam messages Y = all messages containing word “v1agra”
X Y event space Bayes Rule:
Example: Using Bayes Rule Prior probability: P(h) = 0.1 P(h) = 0.9 Likelihood: P(+| h) = 0.6 P(– | h) = 0.4 P(+ | h) = 0.03, P(– | h) = 0.97 Hypotheses: h = “message m is spam” h = “message m is not spam” Data: + = message m contains “viagra” – = message m does not contain “viagra”
P(+) = P(+ | h) P(h) + P(+ | h)P(h) = 0.6 * .1 + .03 * .9 = 0.09 P(–) = 0.91 P(h | +) = P(+ | h) P(h) / P(+) = 0.6 * 0.1 / .09 = .67 How would we learn these prior probabilities and likelihoods from past examples of spam and not spam?
Full joint probability distribution(CORRECTED) Notation: P(h,D) P(h D) P (h +) = P(h | +) P(+) P(h -) = P(h | -) P(-) etc.
P(m=spam, viagra, offer) Now suppose there is a second feature examined: does message contain the word “offer”? Full joint distribution scales exponentially with number of parameters
Bayes optimal classifier for spam: where fi is a feature (here, could be a “keyword”) • In general, intractable.
Classification using “naive Bayes” • Assumes that all features are independent of one another. • How do we learn the naive Bayes model from data? • How do we apply the naive Bayes model to a new instance?
Example: Training and Using Naive Bayes for Classification • Features: • CAPS: Boolean (longest contiguous string of capitalized letters in message is longer than 3) • URL: Boolean (0 if no URL in message, 1 if at least one URL in message) • $: Boolean (0 if $ does not appear at least once in message; 1 otherwise)
Training data: M1:“DON’T MISS THIS AMAZING OFFER $$$!” spam M2: “Dear mm, for more $$, check this out: http://www.spam.com” spam M3: “I plan to offer two sections of CS 250 next year” not spam M4: “Hi Mom, I am a bit short on $$ right now, can you not spam send some ASAP? Love, me”
Training a Naive Bayes Classifier • Two hypotheses: spam or not spam • Estimate: P(spam) = .5 P(spam) = .5 P(CAPS | spam) = .5 P(CAPS | spam) = .5 P(URL | spam) = .5 P(URL | spam) = .5 P($ | spam) = .75 P($ | spam) = .25 P(CAPS | spam )= .5 P(CAPS | spam) = .5 P(URL | spam) = .25 P(URL | spam) = .75 P($ | spam) = .5 P($ | spam) =.5
m-estimate of probability(to fix cases where one of the terms in the product is0):
Now classify new message: M4: “This is a ONE-TIME-ONLY offer that will get you BIG $$$, just click on http://www.spammers.com”
Information Retrieval • Most important concepts: • Defining features of a document • Indexing documents according to features • Retrieving documents in response to a query • Ordering retrieved documents by relevance • Early search engines: • Features: List of all terms (keywords) in document (minus “a”, “the”, etc.) • Indexing: by keyword • Retrieval: by keyword match with query • Ordering: by number of keywords matched • Problems with this approach
Naive Bayesian Document retrieval • Let D be a document (“bag of words”), Q be a query (“bag of words”), and r be the event that D is relevant to Q. • In document retreival, we want to compute: • Or, “odds ratio”: • In the book, they show (via a lot of algebra) that • Chain rule: P(A,B) = p(A|B) p(B)
Naive Bayesian Document retrieval • Let D be a document (“bag of words”), Q be a query (“bag of words”), and r be the event that D is relevant to Q. • In document retreival, we want to compute: • Or, “odds ratio”: • In the book, they show (via a lot of algebra) that • Chain rule: P(A,B) = p(A|B) p(B)
Naive Bayesian Document retrieval • Where Qj is the jth keyword in the query. • The probability of a query given a relevant document D is estimated as the product of the probabilities of each keyword in the query, given the relevant document. • How to learn these probabilities?
Evaluating Information Retrieval Systems • Precision and Recall • Example: Out of corpus of 100 documents, query has following results: • Precision: Fraction of relevant documents in results set = 30/40 = .75 “How precise is results set?” • Recall: Fraction of relevant documents in whole corpus that are in results set = 30/50 = .60 “How many relevant documents were recalled?”
Tradeoff between recall and precision: If we want to ensure that recall is high, just recall a lot of documents. Then precision may be low. If we recall 100% of documents, but only 50% are relevant, then recall is 1, but precision is 0.5. If we want high chance for precision to be high, just recall the single document judged most relevant (“I’m feeling lucky” in Google.) Then precision will (likely) be 1.0, but recall will be low. When do you want high precision? When do you want high recall?
Bayesian approaches to knowledge representation and reasoningPart 2(Chapter 14, sections 1-4)
Recall Naive Bayes method: • This can also be written in terms of “cause” and “effect”:
offer Naive Bayes cause Spam v1agra effects offer stock Bayesian network Spam v1agra stock
offer Each node has a “conditional probability table” that gives its dependencies on its parents. Spam v1agra stock
Semantics of Bayesian networks • If network is correct, can calculate full joint probability distribution from network. where parents(Xi) denotes specific values of parents of Xi. Sum of all boxes is 1.
Example from textbook • I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? • Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls • Network topology reflects "causal" knowledge: • A burglar can set the alarm off • An earthquake can set the alarm off • The alarm can cause Mary to call • The alarm can cause John to call
Complexity of Bayesian Networks • For n random Boolean variables: • Full joint probability distribution: 2n entries • Bayesian network with at most k parents per node: • Each conditional probability table: at most 2kentries • Entire network: n 2k entries
Exact inference in Bayesian networks Query: What is P(Burglary | JohnCalls=true ^ MaryCalls = true)? Notation: Capital letters are distributions; lower case letters are values or variables, depending on context. We have:
Let’s calculate this for b = “Burglary = true”: • Worse case complexity: O(n 2n), where n is number of Boolean • variables. • We can simplify:
A. Onisko et al., A Bayesian network model for diagnosis of liver disorders
Can speed up further via “variable elimination”. However, bottom line on exact inference: In general, it’s intractable. (Exponential in n.) Solution: Approximate inference, by sampling.
Bayesian approaches to knowledge representation and reasoningPart 3(Chapter 14, section 5)
What are the advantages of Bayesian networks? • Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables. • Represents “beliefs and knowledge” about a particular class of situations. • Efficient (?) (approximate) inference algorithms • Efficient, effective learning algorithms
Review of exact inference in Bayesian networks General question: What is P(x|e)? Example Question: What is P(c| r,w)?
Event space Cloudy
Event space Cloudy Rain
Event space Sprinkler Cloudy Rain
Event space Sprinkler Cloudy Wet Grass Rain
Event space Sprinkler Cloudy Wet Grass Rain
Event space Sprinkler Cloudy Wet Grass Rain
Draw expression tree for • Worst-case complexity is exponential in n (number of nodes) • Problem is having to enumerate all possibilities for many variables.