150 likes | 430 Views
Naïve Bayes Classification. Christina Wallin Computer Systems Research Lab 2008-2009. Goal. create a naïve Bayes classifier using the 20 Newsgroup database compare the effectiveness of different implementations of this method. What is the Naïve Bayes?.
E N D
Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009
Goal • create a naïve Bayes classifier using the 20 Newsgroup database • compare the effectiveness of different implementations of this method
What is the Naïve Bayes? - Bayes’ Theorum: Classification method based on independence assumption - “Mars Rover” -Machine Learning
Program Overview • Python with NLTK (Natural Language Toolkit) • file.py • train.py • test.py
Procedures: file.py • Parses a file and makes a dictionary of all of the words present and their frequency • Stemming words • Accounting for length
Procedures: train.py • Training the program as to what words occur more frequently in each class • Make a PFX vector, the probability that each word is in the class • Multivariate or Multinomial • Stopwords
Procedures: Multivariate v. Multinomial • Multivariate • P (w) = (num files with w+1)/(num files in class + num vocab) • Multinomial • P (w) = (frequency of w + 1)/(num words + num vocab)
Example • File 1: Computer, Science, AI, Science • File 2: AI, Computer, Learning, Parallel • Multivariate for Computer: (2+1)/(2+1)=1 • Multinomial for Computer: (2+1)/(8+1)=1/3 • Multivariate for Parallel: (1+1)/(2+1) = 2/3 • Multinomial for Parallel: (1+1)/(8+1) = 2/9 • Multivariate for Science: (1+1)/(2+1) = 2/3 • Multinomial for Science: (2+1)/(8+1) = 1/3
Procedures: test.py • Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole • Use log sum to figure out the probability, because multiplying all of them would cause problems
Testing • Generated text files based on a probability of the words occurring • Compared initial, programmed in, probability to PFX generated • Also used generated files to test text classification • Script for quicker testing
Conclusions • Effect of optimizations • Questions?