1 / 15

Naïve Bayes Classifier

Naïve Bayes Classifier. Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009. Goal. create a naïve Bayes classifier using the 20 Newsgroup database compare the effectiveness of a multivariate Bayes classifier and a multinomial Bayes classifier, with optimizations.

edythe
Download Presentation

Naïve Bayes Classifier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009

  2. Goal • create a naïve Bayes classifier using the 20 Newsgroup database • compare the effectiveness of a multivariate Bayes classifier and a multinomial Bayes classifier, with optimizations

  3. What is the Naïve Bayes? -Classification method based on independence assumption -Machine learning -trained with test cases as to what the classes are, and then can classify texts -classification based on the probability that a word will be in a specific class of text

  4. Previous Research • Algorithm has been around for a while (first use is in 1966) • At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective("Idiot's Bayes--Not So Stupid After All?" by David Hand and Keming Yu)

  5. Previous Research Cont’d • Currently, the best methods use a combination of naïve Bayes and logistic regression (Shen and Jiang, 2003) • Still room for improvement—data selection for training and how to incorporate the text length (Lewis, 2001) • My program will investigate what features of training make them better for naïve Bayes, building upon the basic structure outlined in many papers

  6. Program Overview • Python with NLTK (Natural Language Toolkit) • file.py • train.py • test.py

  7. Procedures: file.py • So far, a program which inputs a text file • Parses file • Makes a dictionary of all of the words present and their frequency • Can choose to stem words or not • With PyLab, can graph the 20 most frequent words

  8. Procedures: train.py • Training the program as to what words occur more frequently in each class • Make a PFX vector, the probability that each word is in the class • Multivariate or Multinomial

  9. Procedures: Multivariate v. Multinomial • Multivariate • PFX(w) = (num files with w+1)/(num files in class + num vocab) • Multinomial • PFX(w) = (frequency of w + 1)/(num words + num vocab)

  10. Example

  11. Procedures: test.py • Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole • Use log sum to figure out the probability, because multiplying all of them would cause problems

  12. Testing • Generated text files based on a probability of the words occurring • Compared initial, programmed in, probability to PFX generated • Also used generated files to test text classification

  13. Results: Effect of stemming

  14. Results: Multivariate v. Multinomial

  15. Still to come • Improve the multinomial version • Optimization • Take into account the number of words in files • Analyze the data from 20 Newsgroups to see why certain certain classes can be classified more easily than others

More Related