Naïve Bayes Classifier

Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009

Goal • -create a naïve Bayes classifier using the 20 Newsgroup database • -compare the effectiveness of a simple naïve Bayes classifier and one optimized

What is the Naïve Bayes? -Classification method based on independence assumption -Machine learning -trained with test cases as to what the classes are, and then can classify texts -classification based on the probability that a word will be in a specific class of text

Previous Research • Algorithm has been around for a while (first use is in 1966) • At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective("Idiot's Bayes--Not So Stupid After All?" by David Hand and Keming Yu)

Previous Research Cont’d • Currently, the best methods use a combination of naïve Bayes and logistic regression (Shen and Jiang, 2003) • Still room for improvement—data selection for training and how to incorporate the text length (Lewis, 2001) • My program will investigate what features of training make them better for naïve Bayes, building upon the basic structure outlined in many papers

Program Overview • Python with NLTK (Natural Language Toolkit) • file.py • train.py • test.py

Procedures: file.py • So far, a program which inputs a text file • Parses file • Makes a dictionary of all of the words present and their frequency • Can choose to stem words or not • With PyLab, can graph the 20 most frequent words

Procedures: train.py • Training the program as to what words occur more frequently in each class • Make a PFX vector, the probability that each word is in the class • Total number of texts in class which have a word/total number of texts in class • Laplace smoothing

Procedures: test.py • Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole • Use log sum to figure out the probability, because multiplying all of them would cause problems

Testing • Generated text files based on a probability of the words occurring • Compared initial, programmed in, probability to PFX generated • Also used generated files to test text classification

Results: file.py 20 most frequent words in sci.space from 20 Newsgroup 20 most frequent words in rec.sports.baseball from 20 Newsgroup

Results: file.py • Approx the same length stories • sci.space more dense and less to the point • Most frequent word, ‘the’, the same

Results: Effect of stemming • 82.6% correctly classified with stemmer vs 83.6% without in alt.atheism and rec.autos • 66.6% vs 67.7% with comp.sys.ibm.pc.hardware and comp.sys.mac.hardware • 69.3% vs 70.4% with sci.crypt and alt.atheism • I expected it to help, but as shown using a Porter stemmer to stem words before generating the probability vector does not help

Still to come • Optimization • Analyze the data from 20 Newsgroups to see why certain certain classes can be classified more easily than others • Change to a multinomial model • Multiple occurrences of words in a file

Naïve Bayes Classifier