1 / 10

Arron La Joey Lei David Cortez

Spam Filtering Team. Arron La Joey Lei David Cortez. Problem. How to differentiate emails Decide if an email is spam or non-spam Gather a diverse knowledge base to develop an unbiased spam filter. Techniques for Implementation. A hash table with “nearest neighbor approach”

zita
Download Presentation

Arron La Joey Lei David Cortez

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spam Filtering Team Arron La Joey Lei David Cortez

  2. Problem • How to differentiate emails • Decide if an email is spam or non-spam • Gather a diverse knowledge base to develop an unbiased spam filter

  3. Techniques for Implementation • A hash table with “nearest neighbor approach” • Nearest neighbor approach with extra data • Bayesian or Neural Networks

  4. The hash table will contain important and common words that may indicate if an email is spam “Nearest Neighbor Approach” Non-Spam E-Mail Spam

  5. Nearest Neighbor Approach with Extra Data • Extra Data Consists are the following: • Size of the email • Content\Subject Line • Punctuation to word ratios • IP addresses

  6. Bayesian Network Approach • Create two hash tables that tallies the number of occurrences of each word in a spam/non-spam email • Create a third hash table that calculates the probability of each word • probability(word) { let g = (2 * # of hashNonSpam(word)) let b = (# of hashSpam(word)) if(g + b) > 5 then max( 0.1, (min 0.99, ((min (numOfSpam / b), 1) / ((min (g/ numOfNonSpam, 1) + min(1, (b/ numOfSpam))) } numOfSpam = # of spam emails numOfNonSpam = # of non-spam emails

  7. Bayesian Network Approach Continue.. • To check email: Take 20 words that has the probability farthest from 0.5 (meaning neutral words) • With those 20 words, use Bayes Rule ab..v prob(word) = ------------------------------ ab..v + (1 - a)(1 - b)..(1-v) If prob(word) > 0.9 == SPAM EMAIL

  8. Methods of Evaluation • Create a training and testing data set to determine effectiveness • Results to compare implementations to one another • Implementations can be compared to other well-known techniques

  9. Blacklist Domains/Emails “White list” Domains Authenticity Checking Header/Context Analysis Checksum Technology User Input Learning (Spam/Non-Spam Button) Classifying Non-Spam Other Techniques of Implementation

  10. Reference • “A Plan for Spam,” Paul Graham, 2003 August, www.paulgraham.com/spam • “Better Bayesian Filtering,” 2003 Spam Conference, www.paulgraham.com/better

More Related