110 likes | 166 Views
This illustrative example covers how Naive Bayes is utilized in document classification, from Binomial and Multinomial Distributions to Advanced Smoothing techniques. Learn about probability estimation, topic classes, and the importance of feature selection in classification tasks.
E N D
Naive Bayes for Document Classification Illustrative Example
Document Classification • Given a document, find its class (e.g. headlines, sports, economics, fashion…) • We assume the document is a “bag-of-words”. d ~ { t1, t2, t3, … tnd } • Using Naive Bayes with multinomial distribution:
Binomial Distribution • n independent trials (a Bernouilli trial), each of which results in success with probability of p • binomial distribution gives the probability of any particular combination of numbers of successes for the two categories. • e.g. You flip a coin 10 times with PHeads=0.6 • What is the probability of getting 8 H, 2T? • P(k) = • with k being number of successes (or to see the similarity with multinomial, consider first class is selected k times, ...)
Multinomial Distribution • Generalization of Binomial distribution • n independent trials, each of which results in one of the k outcomes. • multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories k. • e.g. You have balls in three colours in a bin (3 balls of each color => pR=PG=PB), from which you draw n=9 balls with replacement. What is the probability of getting 8 Red, 1 Green, 0 Blue. • P(x1,x2,x3) =
Naive Bayes w/ Multinomial Model from McCallum and Nigam, 1995 Advanced
Naive Bayes w/ Multivariate Binomial from McCallum and Nigam, 1995 Advanced
Smoothing For each term, t, we need to estimate P(t|c) Tct is the count of term t in all documents of class c 7
Smoothing Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing: Laplace Smoothing |V| is the number of terms in the vocabulary 8
Two topic classes: “China”, “not China” V = {Beijing, Chinese, Japan, Macao, Tokyo, Shangai} N = 4 9
Classification Probability Estimation 10
Summary: Miscellaneous Naïve Bayes is linear in the time is takes to scan the data When we have many terms, the product of probabilities with cause a floating point underflow, therefore: For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection”. However, accuracy is not badly affected by irrelevant attributes, if data is large. 11