600 likes | 907 Views
Naïve Bayes. Advanced Statistical Methods in NLP Ling572 January 19, 2012. Roadmap. Naïve Bayes Multi- variate Bernoulli event model (recap) Multinomial event model Analysis HW#3. Naïve Bayes Models in Detail. (McCallum & Nigam, 1998)
E N D
Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012
Roadmap • Naïve Bayes • Multi-variate Bernoulli event model (recap) • Multinomial event model • Analysis • HW#3
Naïve Bayes Models in Detail • (McCallum & Nigam, 1998) • Alternate models for Naïve Bayes Text Classification • Multivariate Bernoulli event model • Binary independence model • Features treated as binary – counts ignored • Multinomial event model • Unigram language model
Multivariate Bernoulli Event Text Model • Each document: • Result of |V| independent Bernoulli trials • I.e. for each word in vocabulary, • does the word appear in the document? • From general Naïve Bayes perspective • Each word corresponds to two variables, wt and • In each doc, either wt or appears • Always have |V| elements in a document
Training & Testing • Laplace smoothed training: • MAP decision rule classification: • P(c)
Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|}
Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|} • Document is viewed as result of: • One trial for each position • P(word = wi) = pi • Σipi= 1
Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|} • Document is viewed as result of: • One trial for each position • P(word = wi) = pi • Σipi= 1 • P(X1=x1,X2=x2,….,X|V|=x|V|)
Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|} • Document is viewed as result of: • One trial for each position • P(word = wi) = pi • Σipi= 1 • P(X1=x1,X2=x2,….,X|V|=x|V|)
Multinomial Distribution • Trial: select a word according to its probability • Possible outcomes: {w1,w2,…,w|V|} • Document is viewed as result of: • One trial for each position • P(word = wi) = pi • Σipi= 1 • P(X1=x1,X2=x2,….,X|V|=x|V|)
Example • Consider a vocabulary V with only three words: • a, b, c Due to F. Xia
Example • Consider a vocabulary V with only three words: • a, b, c • Document di contains only 2 word instances Due to F. Xia
Example • Consider a vocabulary V with only three words: • a, b, c • Document di contains only 2 word instances • For each position: • (P(w=a)=p1, P(w=b)=p2, P(w=c) = p3 Due to F. Xia
Example • Consider a vocabulary V with only three words: • a, b, c • Document di contains only 2 word instances • For each position: • (P(w=a)=p1, P(w=b)=p2, P(w=c) = p3 • What is the probability that we see ‘a’ once and ‘b’ once in di? Due to F. Xia
Example (cont’d) • How many possible sequences? Due to F. Xia
Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc Due to F. Xia
Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? Due to F. Xia
Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: Due to F. Xia
Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: p1*p2 • Probability of the sequence ‘ba’ Due to F. Xia
Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: p1*p2 • Probability of the sequence ‘ba’ : p1 * p2 • So probability of seeing ‘a’ once and ‘b’ once is: Due to F. Xia
Example (cont’d) • How many possible sequences? 3^2 = 9 • Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc • How many sequences with one ‘a’ and one ‘b’? • n!/(x1!..x|v|!) = 2 • Probability of the sequence ‘ab’ is: p1*p2 • Probability of the sequence ‘ba’ : p1 * p2 • So probability of seeing ‘a’ once and ‘b’ once is: • = 2 p1*p2 Due to F. Xia
Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context
Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context • Define Nit = # of occurrences of wtin document di
Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context • Define Nit = # of occurrences of wtin document di • Then under multinomial event model:
Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context • Define Nit = # of occurrences of wtin document di • Then under multinomial event model:
Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context • Define Nit = # of occurrences of wtin document di • Then under multinomial event model:
Multinomial Event Model • Document is sequence of word events drawn from vocabulary V. • Assume document length independent of class • Assume (Naïve Bayes) words independent of context • Define Nit = # of occurrences of wtin document di • Then under multinomial event model:
Training • P(cj|di)=1 if document di is of class cj, and 0 o.w. • So,
Training • P(cj|di)=1 if document di is of class cj, and 0 o.w. • So,
Training • P(cj|di)=1 if document di is of class cj, and 0 o.w. • So,
Training • P(cj|di)=1 if document di is of class cj, and 0 o.w. • So,
Training • P(cj|di)=1 if document di is of class cj, and 0 o.w. • So, • Contrast this with multivariate Bernoulli
Testing • To classify a document di compute: • argmaxc P(c)P(di|c)
Testing • To classify a document di compute: • argmaxc P(c)P(di|c) • argmaxc P(c)
Two Naïve Bayes Models • Multi-variate Bernoulli event model: • Models binary presence/absence of word feature
Two Naïve Bayes Models • Multi-variate Bernoulli event model: • Models binary presence/absence of word feature • Multinomial event model: • Models counts of word features, unigram models
Two Naïve Bayes Models • Multi-variate Bernoulli event model: • Models binary presence/absence of word feature • Multinomial event model: • Models counts of word features, unigram models • In experiments on a range of different text classification corpora, multinomial model usually outperforms multivariate Bernoulli (McCallum & Nigam, 1998)
Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why?
Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why? • Classification based on sign, not magnitude • Direction of classification usually right • Multivariate Bernoulli vs Multinomial • Why does multinomial perform better?
Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why? • Classification based on sign, not magnitude • Direction of classification usually right • Multivariate Bernoulli vs Multinomial • Why does multinomial perform better? • Captures additional information: presence/absence+freq • What if we wanted to include other types of features?
Thinking about Performance • Naïve Bayes: conditional independence assumption • Clearly unrealistic, but performance is often good • Why? • Classification based on sign, not magnitude • Direction of classification usually right • Multivariate Bernoulli vs Multinomial • Why does multinomial perform better? • Captures additional information: presence/absence+freq • What if we wanted to include other types of features? • Multivariate: just another Bernoulli trial • Multinomial can’t mix distributions