Naïve Bayes Classifier

Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009

Goal • create a naïve Bayes classifier using the 20 Newsgroup database • compare the effectiveness of a multivariate Bayes classifier and a multinomial Bayes classifier, with optimizations

What is the Naïve Bayes? -Classification method based on independence assumption -Machine learning -trained with test cases as to what the classes are, and then can classify texts -classification based on the probability that a word will be in a specific class of text

Previous Research • Algorithm has been around for a while (first use is in 1966) • At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective("Idiot's Bayes--Not So Stupid After All?" by David Hand and Keming Yu)

Previous Research Cont’d • Currently, the best methods use a combination of naïve Bayes and logistic regression (Shen and Jiang, 2003) • Still room for improvement—data selection for training and how to incorporate the text length (Lewis, 2001) • My program will investigate what features of training make them better for naïve Bayes, building upon the basic structure outlined in many papers

Program Overview • Python with NLTK (Natural Language Toolkit) • file.py • train.py • test.py

Procedures: file.py • So far, a program which inputs a text file • Parses file • Makes a dictionary of all of the words present and their frequency • Can choose to stem words or not • With PyLab, can graph the 20 most frequent words

Procedures: train.py • Training the program as to what words occur more frequently in each class • Make a PFX vector, the probability that each word is in the class • Multivariate or Multinomial

Procedures: Multivariate v. Multinomial • Multivariate • PFX(w) = (num files with w+1)/(num files in class + num vocab) • Multinomial • PFX(w) = (frequency of w + 1)/(num words + num vocab)

Example

Procedures: test.py • Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole • Use log sum to figure out the probability, because multiplying all of them would cause problems

Testing • Generated text files based on a probability of the words occurring • Compared initial, programmed in, probability to PFX generated • Also used generated files to test text classification

Results: Effect of stemming

Results: Multivariate v. Multinomial

Still to come • Improve the multinomial version • Optimization • Take into account the number of words in files • Analyze the data from 20 Newsgroups to see why certain certain classes can be classified more easily than others

Naïve Bayes Classifier