Naive Bayes for Document Classification

Naive Bayes for Document Classification Illustrative Example

Document Classification • Given a document, find its class (e.g. headlines, sports, economics, fashion…) • We assume the document is a “bag-of-words”. d ~ { t1, t2, t3, … tnd } • Using Naive Bayes with multinomial distribution:

Binomial Distribution • n independent trials (a Bernouilli trial), each of which results in success with probability of p • binomial distribution gives the probability of any particular combination of numbers of successes for the two categories. • e.g. You flip a coin 10 times with PHeads=0.6 • What is the probability of getting 8 H, 2T? • P(k) = • with k being number of successes (or to see the similarity with multinomial, consider first class is selected k times, ...)

Multinomial Distribution • Generalization of Binomial distribution • n independent trials, each of which results in one of the k outcomes. • multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories k. • e.g. You have balls in three colours in a bin (3 balls of each color => pR=PG=PB), from which you draw n=9 balls with replacement. What is the probability of getting 8 Red, 1 Green, 0 Blue. • P(x1,x2,x3) =

Naive Bayes w/ Multinomial Model from McCallum and Nigam, 1995 Advanced

Naive Bayes w/ Multivariate Binomial from McCallum and Nigam, 1995 Advanced

Smoothing For each term, t, we need to estimate P(t|c) Tct is the count of term t in all documents of class c 7

Smoothing Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing: Laplace Smoothing |V| is the number of terms in the vocabulary 8

Two topic classes: “China”, “not China” V = {Beijing, Chinese, Japan, Macao, Tokyo, Shangai} N = 4 9

Classification Probability Estimation 10

Summary: Miscellaneous Naïve Bayes is linear in the time is takes to scan the data When we have many terms, the product of probabilities with cause a floating point underflow, therefore: For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection”. However, accuracy is not badly affected by irrelevant attributes, if data is large. 11

Naive Bayes for Document Classification