1 / 27

Naïve Bayes

Naïve Bayes. LING 572 Fei Xia Week 2: 1/9/06. Outline. Naïve Bayes in general Naïve Bayes for TC. Questions. Why is it called Naïve Bayes? What objective function does it optimize? How many types of model parameters? What happen at the training time? What happen at the test time?

emily
Download Presentation

Naïve Bayes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Naïve Bayes LING 572 Fei Xia Week 2: 1/9/06

  2. Outline • Naïve Bayes in general • Naïve Bayes for TC

  3. Questions • Why is it called Naïve Bayes? • What objective function does it optimize? • How many types of model parameters? • What happen at the training time? • What happen at the test time? • Any variations?

  4. Modeling • Given x=(f1, …, fd), find c* = arg maxc P(c|x) = arg maxc P(c) P(x|c) / P(x)  Bayes = arg maxc P(c) P(x|c) • Independence assumption: P(x|c) = P(f1, f2, …, fd | c) = k P(fk | c, f1k-1) ¼k P(fk | c)  “Naïve”

  5. Naïve Bayes Model C … fn f2 f1 Assumption: each fi is conditionally independent from fj given C.

  6. Model parameters • Choose c* = arg maxc P(c) k P(fk | c) • Two types of model parameters: • Class prior: P(c) • Conditional prob: P(fk | c) • The number of parameters: |C|+|CV| • How many parameters are free?

  7. Training: estimating parameters  • Maximum likelihood (ML): * = arg max P(trainingData | ) • P(fk | ci) = Cnt(fk, ci) / Cnt(ci) • P(ci) = Cnt(ci) / i Cnt(ci)

  8. Laplace Estimate/Correction/Smoothing • Pretend you saw outcome one more than you actually did. • Suppose X has K possible outcomes, and the counts for them are n1, …, nK , which sum to N. • Without smoothing: P(X=i) = ni /N • With Laplace smoothing: P(X=i) = (ni + 1) / (N+K) • It can be derived from Dirichlet priors as a MAP estimate.

  9. Classifying • MAP (maximum a posteriori) decision rule: classify (x) = classify (f1, .., fd) = argmaxc p(c|x) = argmaxc p(c) k p(fk | c)

  10. Naïve Bayes for TC

  11. Features • Features: bag of words (word order information is lost) • Number of feature templates: 1 • Number of features: |V| • Features: wt, t 2 {1, 2, …, |V|}

  12. Issues • Is wt a binary feature? • Are absent features used for calculating P(dj | ci) ?

  13. Two Naive Bayes Models (McCallum and Nigram, 1998) • Multi-variate Bernoulli event model (a.k.a. binary independence model) • All features are binary: the number of times a feature occurs in an instance is ignored. • When calculating p(d | c), all features are used, including the absent features. • Multinomial event model: “unigram LM”

  14. Bernoulli distribution • Bernoulli distribution: has exactly two mutually exclusive outcomes: P(X=1)=p and P(X=0)=1-p. • Bernoulli trial: a single experiment which can have one of two possible outcomes • A Bernoulli process is a sequence of iid (independent identically distributed) Bernoulli trials.

  15. Multi-variate Bernoulli Model • A document is seen as a collection of |V| independent Bernoulli experiments, one for each word in the vocabulary: does this word appear in the document? • Let Bit =1 if wt appears in di = 0 otherwise • Modeling: P(di | cj) = k ( Bit P(wt | cj) + (1-Bit) (1 – P(wt | cj)) • Training: P(ci) = DocNum(ci) / DocNum

  16. Training (cont) P(wt | cj) = (1+DocNum(wt, cj)) / (2+DocNum(cj)) Where P(cj | di) = 1 if di has the label cj = 0 otherwise

  17. Questions about Bernoulli event model?

  18. Multinomial distribution • Possible outcomes = {w1, w2, …, w|v|} • A trial for each word position: P(CurWord=wi)=pi and i pi = 1 • Perform n Bernoulli trials: n is the length of the document • Let Xi be the number of times that the word wi is observed in the document. • P(X1=x1,…,Xv=xv} = n!/(x1!...xv!) * p1x1…pvxv = n! k (pkxk / xk!)

  19. An example • Suppose • the voc, V, contains only three words: a, b, and c. • a document, di, contains only 2 word tokens • For each position, P(w=a)=p1, P(w=b)=p2 and P(w=c)=p3. • What is the prob that we see “a” once and “b” once in di?

  20. An example (cont) • 9 possible sequences: aa, ab, ac, ba, bb, bc, cc, cb, cc. • The number of sequences with one “a” and one “b” (ab and ba): n!/(x1!...xv!) • The prob of the sequence “ab” is p1*p2, so is the prob of the sequence “ba”. • So the prob of seeing “a” once and “b” once is: n! k (pkxk / xk!) = 2 p1*p2

  21. Multinomial Model • A document is seen as an order sequence of word events, drawn from the vocabulary V. • Nit: the number of times that wt appears in di • Modeling: multinomial distribution:

  22. Training for multinomial model

  23. Two models • Bernoulli event model: treat features as binary; each trial corresponds to a feature. • Multinomial event model: treat features as non-binary; each trial corresponds to a word position in the document. • Multinomial event model usually beats the Bernoulli event model (McCallum and Nigram, 1998)

  24. Summary of Naïve Bayes • It makes a strong independence assumption. • It generally works well despite the strong assumption. Why? • Both training and testing are simple and fast.

  25. Summary of Naïve Bayes (cont) • Strengths: • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Scalability • Output topN • Weakness: • Theoretical validity • Predication accuracy: ?? • Stability and robustness

  26. Today • Classification algorithm overview • Naïve Bayes in general • Naïve Bayes for text classification

  27. Coming up • kNN and Rocchio on Thurs: read the paper • An additional lab session right after Thursday’s class. • Hw1 is due at 11pm on Sat: no extension

More Related