100 likes | 199 Views
Spam Filtering Team. Arron La Joey Lei David Cortez. Problem. How to differentiate emails Decide if an email is spam or non-spam Gather a diverse knowledge base to develop an unbiased spam filter. Techniques for Implementation. A hash table with “nearest neighbor approach”
E N D
Spam Filtering Team Arron La Joey Lei David Cortez
Problem • How to differentiate emails • Decide if an email is spam or non-spam • Gather a diverse knowledge base to develop an unbiased spam filter
Techniques for Implementation • A hash table with “nearest neighbor approach” • Nearest neighbor approach with extra data • Bayesian or Neural Networks
The hash table will contain important and common words that may indicate if an email is spam “Nearest Neighbor Approach” Non-Spam E-Mail Spam
Nearest Neighbor Approach with Extra Data • Extra Data Consists are the following: • Size of the email • Content\Subject Line • Punctuation to word ratios • IP addresses
Bayesian Network Approach • Create two hash tables that tallies the number of occurrences of each word in a spam/non-spam email • Create a third hash table that calculates the probability of each word • probability(word) { let g = (2 * # of hashNonSpam(word)) let b = (# of hashSpam(word)) if(g + b) > 5 then max( 0.1, (min 0.99, ((min (numOfSpam / b), 1) / ((min (g/ numOfNonSpam, 1) + min(1, (b/ numOfSpam))) } numOfSpam = # of spam emails numOfNonSpam = # of non-spam emails
Bayesian Network Approach Continue.. • To check email: Take 20 words that has the probability farthest from 0.5 (meaning neutral words) • With those 20 words, use Bayes Rule ab..v prob(word) = ------------------------------ ab..v + (1 - a)(1 - b)..(1-v) If prob(word) > 0.9 == SPAM EMAIL
Methods of Evaluation • Create a training and testing data set to determine effectiveness • Results to compare implementations to one another • Implementations can be compared to other well-known techniques
Blacklist Domains/Emails “White list” Domains Authenticity Checking Header/Context Analysis Checksum Technology User Input Learning (Spam/Non-Spam Button) Classifying Non-Spam Other Techniques of Implementation
Reference • “A Plan for Spam,” Paul Graham, 2003 August, www.paulgraham.com/spam • “Better Bayesian Filtering,” 2003 Spam Conference, www.paulgraham.com/better