1 / 14

Naïve Bayes Classifier

Naïve Bayes Classifier. Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009. Goal. -create a naïve Bayes classifier using the 20 Newsgroup database -compare the effectiveness of a simple naïve Bayes classifier and one optimized. What is the Naïve Bayes?.

eric-hull
Download Presentation

Naïve Bayes Classifier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009

  2. Goal • -create a naïve Bayes classifier using the 20 Newsgroup database • -compare the effectiveness of a simple naïve Bayes classifier and one optimized

  3. What is the Naïve Bayes? -Classification method based on independence assumption -Machine learning -trained with test cases as to what the classes are, and then can classify texts -classification based on the probability that a word will be in a specific class of text

  4. Previous Research • Algorithm has been around for a while (first use is in 1966) • At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective("Idiot's Bayes--Not So Stupid After All?" by David Hand and Keming Yu)

  5. Previous Research Cont’d • Currently, the best methods use a combination of naïve Bayes and logistic regression (Shen and Jiang, 2003) • Still room for improvement—data selection for training and how to incorporate the text length (Lewis, 2001) • My program will investigate what features of training make them better for naïve Bayes, building upon the basic structure outlined in many papers

  6. Program Overview • Python with NLTK (Natural Language Toolkit) • file.py • train.py • test.py

  7. Procedures: file.py • So far, a program which inputs a text file • Parses file • Makes a dictionary of all of the words present and their frequency • Can choose to stem words or not • With PyLab, can graph the 20 most frequent words

  8. Procedures: train.py • Training the program as to what words occur more frequently in each class • Make a PFX vector, the probability that each word is in the class • Total number of texts in class which have a word/total number of texts in class • Laplace smoothing

  9. Procedures: test.py • Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole • Use log sum to figure out the probability, because multiplying all of them would cause problems

  10. Testing • Generated text files based on a probability of the words occurring • Compared initial, programmed in, probability to PFX generated • Also used generated files to test text classification

  11. Results: file.py 20 most frequent words in sci.space from 20 Newsgroup 20 most frequent words in rec.sports.baseball from 20 Newsgroup

  12. Results: file.py • Approx the same length stories • sci.space more dense and less to the point • Most frequent word, ‘the’, the same

  13. Results: Effect of stemming • 82.6% correctly classified with stemmer vs 83.6% without in alt.atheism and rec.autos • 66.6% vs 67.7% with comp.sys.ibm.pc.hardware and comp.sys.mac.hardware • 69.3% vs 70.4% with sci.crypt and alt.atheism • I expected it to help, but as shown using a Porter stemmer to stem words before generating the probability vector does not help

  14. Still to come • Optimization • Analyze the data from 20 Newsgroups to see why certain certain classes can be classified more easily than others • Change to a multinomial model • Multiple occurrences of words in a file

More Related