Naïve Bayes

Naïve Bayes LING 572 Fei Xia Week 2: 1/9/06

Outline • Naïve Bayes in general • Naïve Bayes for TC

Questions • Why is it called Naïve Bayes? • What objective function does it optimize? • How many types of model parameters? • What happen at the training time? • What happen at the test time? • Any variations?

Naïve Bayes Model C … fn f2 f1 Assumption: each fi is conditionally independent from fj given C.

Model parameters • Choose c* = arg maxc P(c) k P(fk | c) • Two types of model parameters: • Class prior: P(c) • Conditional prob: P(fk | c) • The number of parameters: |C|+|CV| • How many parameters are free?

Training: estimating parameters  • Maximum likelihood (ML): * = arg max P(trainingData | ) • P(fk | ci) = Cnt(fk, ci) / Cnt(ci) • P(ci) = Cnt(ci) / i Cnt(ci)

Laplace Estimate/Correction/Smoothing • Pretend you saw outcome one more than you actually did. • Suppose X has K possible outcomes, and the counts for them are n1, …, nK , which sum to N. • Without smoothing: P(X=i) = ni /N • With Laplace smoothing: P(X=i) = (ni + 1) / (N+K) • It can be derived from Dirichlet priors as a MAP estimate.

Classifying • MAP (maximum a posteriori) decision rule: classify (x) = classify (f1, .., fd) = argmaxc p(c|x) = argmaxc p(c) k p(fk | c)

Naïve Bayes for TC

Features • Features: bag of words (word order information is lost) • Number of feature templates: 1 • Number of features: |V| • Features: wt, t 2 {1, 2, …, |V|}

Issues • Is wt a binary feature? • Are absent features used for calculating P(dj | ci) ?

Two Naive Bayes Models (McCallum and Nigram, 1998) • Multi-variate Bernoulli event model (a.k.a. binary independence model) • All features are binary: the number of times a feature occurs in an instance is ignored. • When calculating p(d | c), all features are used, including the absent features. • Multinomial event model: “unigram LM”

Bernoulli distribution • Bernoulli distribution: has exactly two mutually exclusive outcomes: P(X=1)=p and P(X=0)=1-p. • Bernoulli trial: a single experiment which can have one of two possible outcomes • A Bernoulli process is a sequence of iid (independent identically distributed) Bernoulli trials.

Multi-variate Bernoulli Model • A document is seen as a collection of |V| independent Bernoulli experiments, one for each word in the vocabulary: does this word appear in the document? • Let Bit =1 if wt appears in di = 0 otherwise • Modeling: P(di | cj) = k ( Bit P(wt | cj) + (1-Bit) (1 – P(wt | cj)) • Training: P(ci) = DocNum(ci) / DocNum

Training (cont) P(wt | cj) = (1+DocNum(wt, cj)) / (2+DocNum(cj)) Where P(cj | di) = 1 if di has the label cj = 0 otherwise

Questions about Bernoulli event model?

Multinomial distribution • Possible outcomes = {w1, w2, …, w|v|} • A trial for each word position: P(CurWord=wi)=pi and i pi = 1 • Perform n Bernoulli trials: n is the length of the document • Let Xi be the number of times that the word wi is observed in the document. • P(X1=x1,…,Xv=xv} = n!/(x1!...xv!) * p1x1…pvxv = n! k (pkxk / xk!)

An example • Suppose • the voc, V, contains only three words: a, b, and c. • a document, di, contains only 2 word tokens • For each position, P(w=a)=p1, P(w=b)=p2 and P(w=c)=p3. • What is the prob that we see “a” once and “b” once in di?

An example (cont) • 9 possible sequences: aa, ab, ac, ba, bb, bc, cc, cb, cc. • The number of sequences with one “a” and one “b” (ab and ba): n!/(x1!...xv!) • The prob of the sequence “ab” is p1*p2, so is the prob of the sequence “ba”. • So the prob of seeing “a” once and “b” once is: n! k (pkxk / xk!) = 2 p1*p2

Multinomial Model • A document is seen as an order sequence of word events, drawn from the vocabulary V. • Nit: the number of times that wt appears in di • Modeling: multinomial distribution:

Training for multinomial model

Two models • Bernoulli event model: treat features as binary; each trial corresponds to a feature. • Multinomial event model: treat features as non-binary; each trial corresponds to a word position in the document. • Multinomial event model usually beats the Bernoulli event model (McCallum and Nigram, 1998)

Summary of Naïve Bayes • It makes a strong independence assumption. • It generally works well despite the strong assumption. Why? • Both training and testing are simple and fast.

Summary of Naïve Bayes (cont) • Strengths: • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Scalability • Output topN • Weakness: • Theoretical validity • Predication accuracy: ?? • Stability and robustness

Today • Classification algorithm overview • Naïve Bayes in general • Naïve Bayes for text classification

Coming up • kNN and Rocchio on Thurs: read the paper • An additional lab session right after Thursday’s class. • Hw1 is due at 11pm on Sat: no extension

Naïve Bayes

Naïve Bayes

Presentation Transcript

Part IV: Monte Carlo and nonparametric Bayes

Computer Vision Chapter 4

Concussion Presentation, Management, and Prevention

An introduction to Bayesian Networks and the Bayes Net Toolbox for Matlab

The Bayes Net Toolbox for Matlab and applications to computer vision

Word Sense Disambiguation

Text Classification and Na ï ve Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Maximum Entropy

Chapter 3: Supervised Learning

Besov Bayes Chomsky Plato

Bayesian estimation Why and How to Run Your First Bayesian Model

Recursive Bayes Filtering Advanced AI

人工智能 Artificial Intelligence