220 likes | 250 Views
Bayesian Spam Filters. Key Concepts Conditional Probability Independence Bayes Theorem. Spam or Ham?. FROM : Terry Delaney [removed] TO : (removed) Subject : FDA approved on-line pharmacies! click here (removed) Chose your product and site below:
E N D
Bayesian Spam Filters • Key Concepts • Conditional Probability • Independence • Bayes Theorem
Spam or Ham? FROM: Terry Delaney [removed] TO: (removed) Subject: FDA approved on-line pharmacies! click here (removed) Chose your product and site below: Canadian pharmacy (removed) - Cialis Soft Tabs - $5.78, Viagra Professional - $4.07, Soma - $1.38, Human Growth Hormone - $43.37, Meridia - $3.32, Tramadol - $2.17, Levitra - $11.97.
Quick Reminders • Conditional Probability: Events E, F with • Independence: E and F are independent if and only if
Applying Baye’s Theorem • Let our sample space be the set of emails. • Let S be the event a message is spam; hence is the event a message is not spam • Let E be the event a message contains a word w.
Spam based on single words? • Probabilities based on single words: Bad Idea • False positives AND false negatives aplenty • Calculate based on n words, assuming each event Ei|S (Ei|SC) is independent; P(S) = P(SC).
How do we use this? • User must train the filter based on messages in his/her inbox to estimate probabilities • The program or user must define a threshold probability r: • If , the message is considered spam.
Example • Suppose the filter has the following data • Threshold Probability: .9 • “Viagra” occurs in 250 of 2000 spam messages • “Viagra” occurs in only 5 of 1000 non-spam messages • Let’s try to estimate the probability, using the process we just defined
Example Cont. • Step 1: Find the probability that the message has the word “Viagra” in it and is spam. • p(Viagra) = 250 / 2000 = 0.125 • Step 2: Find the probability that the message has the word “Viagra” in it and is not spam. • q(Viagra) = 5 / 1000 = 0.005
Example Cont. • Since we are assuming that it is equally likely that an incoming message is or is not spam, we can estimate the probability with this equation: • r(Viagra) = p(Viagra) p(Viagra) + q(Viagra)
Example Cont. • 0.125 0.125 + 0.005 = 0.125 0.130 = 0.962 Since r(Viagra) is greater than the threshold of 0.9, we can reject this message as spam.
Harder Stuff • Single-word detection can lead to a lot of false positives and false negatives. • To counter this, most spam filters look for the presence of multiple words.
Another Example • 2000 Spam messages; 1000 real messages • “Viagra” appears in 400 spam messages • “Viagra” appears in 60 real messages • “Cialis” appears in 200 spam and 25 real messages • Threshold Probability: .9 • Let’s calculate the probability that it’s spam.
Example Cont. • Step 1: Find the probability that the message has the word “Viagra” in it and is spam. • p(Viagra) = 400 / 2000 = 0.2 • Step 2: Find the probability that the message has the word “Viagra” and is not spam. • q(Viagra) = 60 / 1000 = 0.06
Example Cont. • Step 3: Find the probability that the message contains the word “Cialis” and is spam. • p(Cialis) = 200 / 2000 = 0.1 • Step 4: Find the probability that the message contains the word “Cialis” and is not spam. • q(Cialis) = 25 / 1000 = 0.025
Example Cont • Using our approximation, we have: • r(Viagra,Cialis) = p(Viagra) * p(Cialis) p(Viagra) * p(Cialis) + q(Viagra) * q(Cialis)
Example Cont. • r(Viagra,Cialis) = (0.2)(0.1) (0.2)(0.1) + (0.6)(0.025) = 0.930 This message will be rejected however since we set the threshold probability at 0.9.