Naïve Bayes Classification

Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009

Goal • create a naïve Bayes classifier using the 20 Newsgroup database • compare the effectiveness of different implementations of this method

What is the Naïve Bayes? - Bayes’ Theorum: Classification method based on independence assumption - “Mars Rover” -Machine Learning

Program Overview • Python with NLTK (Natural Language Toolkit) • file.py • train.py • test.py

Procedures: file.py • Parses a file and makes a dictionary of all of the words present and their frequency • Stemming words • Accounting for length

Procedures: train.py • Training the program as to what words occur more frequently in each class • Make a PFX vector, the probability that each word is in the class • Multivariate or Multinomial • Stopwords

Procedures: Multivariate v. Multinomial • Multivariate • P (w) = (num files with w+1)/(num files in class + num vocab) • Multinomial • P (w) = (frequency of w + 1)/(num words + num vocab)

Example • File 1: Computer, Science, AI, Science • File 2: AI, Computer, Learning, Parallel • Multivariate for Computer: (2+1)/(2+1)=1 • Multinomial for Computer: (2+1)/(8+1)=1/3 • Multivariate for Parallel: (1+1)/(2+1) = 2/3 • Multinomial for Parallel: (1+1)/(8+1) = 2/9 • Multivariate for Science: (1+1)/(2+1) = 2/3 • Multinomial for Science: (2+1)/(8+1) = 1/3

Procedures: test.py • Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole • Use log sum to figure out the probability, because multiplying all of them would cause problems

Testing • Generated text files based on a probability of the words occurring • Compared initial, programmed in, probability to PFX generated • Also used generated files to test text classification • Script for quicker testing

Results: Effect of stemming

Results: Multivariate v. Multinomial

Results: Accounting for Length

Results: Stopwords

Conclusions • Effect of optimizations • Questions?

Naïve Bayes Classification