Naïve Bayes

Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012

Roadmap • Naïve Bayes • Multi-variate Bernoulli event model (recap) • Multinomial event model • Analysis • HW#3

Naïve Bayes Models in Detail • (McCallum & Nigam, 1998) • Alternate models for Naïve Bayes Text Classification • Multivariate Bernoulli event model • Binary independence model • Features treated as binary – counts ignored • Multinomial event model • Unigram language model

Multivariate Bernoulli Event Text Model • Each document: • Result of |V| independent Bernoulli trials • I.e. for each word in vocabulary, • does the word appear in the document? • From general Naïve Bayes perspective • Each word corresponds to two variables, wt and • In each doc, either wt or appears • Always have |V| elements in a document

Training & Testing • Laplace smoothed training: • MAP decision rule classification: • P(c)

Multinomial Event Model

Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|}

Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|} • Document is viewed as result of: • One trial for each position • P(word = wi) = pi • Σipi= 1

Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|} • Document is viewed as result of: • One trial for each position • P(word = wi) = pi • Σipi= 1 • P(X1=x1,X2=x2,….,X|V|=x|V|)

Example • Consider a vocabulary V with only three words: • a, b, c Due to F. Xia

Example • Consider a vocabulary V with only three words: • a, b, c • Document di contains only 2 word instances Due to F. Xia

Example • Consider a vocabulary V with only three words: • a, b, c • Document di contains only 2 word instances • For each position: • (P(w=a)=p1, P(w=b)=p2, P(w=c) = p3 Due to F. Xia

Example • Consider a vocabulary V with only three words: • a, b, c • Document di contains only 2 word instances • For each position: • (P(w=a)=p1, P(w=b)=p2, P(w=c) = p3 • What is the probability that we see ‘a’ once and ‘b’ once in di? Due to F. Xia

Example (cont’d) • How many possible sequences? Due to F. Xia

Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc Due to F. Xia

Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? Due to F. Xia

Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: Due to F. Xia

Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: p1*p2 • Probability of the sequence ‘ba’ Due to F. Xia

Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: p1*p2 • Probability of the sequence ‘ba’ : p1 * p2 • So probability of seeing ‘a’ once and ‘b’ once is: Due to F. Xia

Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: p1*p2 • Probability of the sequence ‘ba’ : p1 * p2 • So probability of seeing ‘a’ once and ‘b’ once is: • = 2 p1*p2 Due to F. Xia

Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context

Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context • Define Nit = # of occurrences of wtin document di

Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context • Define Nit = # of occurrences of wtin document di • Then under multinomial event model:

Training • P(cj|di)=1 if document di is of class cj, and 0 o.w. • So,

Training • P(cj|di)=1 if document di is of class cj, and 0 o.w. • So, • Contrast this with multivariate Bernoulli

Testing • To classify a document di compute: • argmaxc P(c)P(di|c)

Testing • To classify a document di compute: • argmaxc P(c)P(di|c) • argmaxc P(c)

Two Naïve Bayes Models • Multi-variate Bernoulli event model: • Models binary presence/absence of word feature

Two Naïve Bayes Models • Multi-variate Bernoulli event model: • Models binary presence/absence of word feature • Multinomial event model: • Models counts of word features, unigram models

Two Naïve Bayes Models • Multi-variate Bernoulli event model: • Models binary presence/absence of word feature • Multinomial event model: • Models counts of word features, unigram models • In experiments on a range of different text classification corpora, multinomial model usually outperforms multivariate Bernoulli (McCallum & Nigam, 1998)

Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why?

Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why? • Classification based on sign, not magnitude • Direction of classification usually right • Multivariate Bernoulli vs Multinomial • Why does multinomial perform better?

Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why? • Classification based on sign, not magnitude • Direction of classification usually right • Multivariate Bernoulli vs Multinomial • Why does multinomial perform better? • Captures additional information: presence/absence+freq • What if we wanted to include other types of features?

Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why? • Classification based on sign, not magnitude • Direction of classification usually right • Multivariate Bernoulli vs Multinomial • Why does multinomial perform better? • Captures additional information: presence/absence+freq • What if we wanted to include other types of features? • Multivariate: just another Bernoulli trial • Multinomial can’t mix distributions

Model Comparison

Naïve Bayes

Naïve Bayes

Presentation Transcript

Particle Filters In Robotics or: How the World Became To Be One Big Bayes Network

Lecture 3 Empirical Bayes and Proc Mixed

Bayesian models of inductive learning

Supervised Learning

Recursive Bayes Filtering Advanced AI

Confidence Intervals and Hypothesis Tests (Statistical Inference)

NAVIE BAYES CLASSIFICATION

WHY BAYES? INNOVATIONS IN CLINICAL TRIAL DESIGN & ANALYSIS

Introduction to Machine Learning

Naïve Bayes

Decision making under uncertainty

Particle Filter/Monte Carlo Localization

Overview

Text Classification

Classification

Classifying Categorical Data

Intro to Probability

A gentle introduction to the mathematics of biosurveillance: Bayes Rule and Bayes Classifiers

T. Bayes, Phil. Trans. Roy. Soc. , 330 (1763).

Naïve Bayes

Naïve Bayes

Presentation Transcript

Particle Filters In Robotics or: How the World Became To Be One Big Bayes Network

Lecture 3 Empirical Bayes and Proc Mixed

Bayesian models of inductive learning

Supervised Learning

Recursive Bayes Filtering Advanced AI

Confidence Intervals and Hypothesis Tests (Statistical Inference)

NAVIE BAYES CLASSIFICATION

WHY BAYES? INNOVATIONS IN CLINICAL TRIAL DESIGN &amp; ANALYSIS

Introduction to Machine Learning

Naïve Bayes

Decision making under uncertainty

Particle Filter/Monte Carlo Localization

Overview

Text Classification

Classification

Classifying Categorical Data

Intro to Probability

A gentle introduction to the mathematics of biosurveillance: Bayes Rule and Bayes Classifiers

T. Bayes, Phil. Trans. Roy. Soc. , 330 (1763).

WHY BAYES? INNOVATIONS IN CLINICAL TRIAL DESIGN & ANALYSIS