250 likes | 665 Views
Spam Filtering An Artificial Intelligence Showcase. Presented by: Alex Misstear. What is Spam. Messages sent indiscriminately to a large number of recipients We all hate it Term attributed to a Monty Python skit Legitimate messages sometimes referred to as “ham ”. History of Spam.
E N D
Spam FilteringAn Artificial Intelligence Showcase Presented by: Alex Misstear
What is Spam • Messages sent indiscriminately to a large number of recipients • We all hate it • Term attributed to a Monty Python skit • Legitimate messages sometimes referred to as “ham”.
History of Spam • First recorded case in 1978 • An ad created by Digital Equipment Corporation • Sent to a few hundred over ARPANET • Instant negative feedback but did result in some sales • Term first used to describe an accidental post caused by a bug to a USENET newsgroup in 1993 • Considered humorous at the time • First major use as a business practice in 1994
Spam Email Everywhere • Spam estimations (Symantec): • January 2013: 64.1% • December 2012: 70.6% • July 2012: 67.6% • January 2012: 69.0% • At times these figures can be > 80%
Filtering Techniques • Rule based • Prone to false positives • E.g.: The word mortgage appears in a lot of spam but also some very important ham. • Checksum Filtering • Easily circumvented by senders • Insert random characters to disrupt the hash • Blacklisting/whitelisting • Prone to complications for the recipient • Bayesian Filtering • Low false positives • Many more…
Bayesian Spam Filtering • Particular chunks of text occur often in spam while seldom in ham messages. • First introduced in 1996 • Improved upon by Paul Graham in 2002 • Not just a simple text classification problem. Obscure characters/HTML content is seen. • leetspeak: v1agra • IP addresses: (127.0.0.1) • Empty HTML comments: <!-- -->
Concept • Based on the idea that the probability of a message being spam is related to the previous occurrences of words in the message. • Each word can be used to help calculate this probability. • Maintain a database of words to probabilities • Probability the word appears in spam • Probability the word appears in ham
Biased & Unbiased Filtering • Biased filters are adjusted based on reports and may assign P(S) = 0.8 and P(H) = 0.2 • Most spam filters take an unbiased approach and consider all messages to have an equal probability of being spam or ham. • Therefore, the equation is shortened:
Example • The word under investigation: refinance • Appears in 5/500 ham messages • Appears in 400/5000 spam messages • Referred to as “spamicity”
Applying Bayes Theorem • Break down messages into words as they arrive. • Single out the most interesting/relevant words (those with the greatest spam probability in the database). • Generate the spamicity for each. • Combine all the spamicities • If the overall spamicity is greater than a certain threshold the message is marked as spam
Combining Probabilities • Given a set of all the singled out spamicities , calculate the overall probability of spam: • Naïve Bayesian classification • All words/probabilities are considered independent of one another • Email is not a straight text classification problem
Results • Statistics vary based on the individual/message received • Spam detection rates of 99.7% are common • 0.03% of false positives • Calculating spamicity for phrases has been shown to improve these numbers slightly • Requires an initial learning period with ham/spam classification feedback to build the database • Typically a couple weeks
Bayesian Poisoning • Spammers send messages with random, seemingly legitimate words to degrade the filters • Future spam messages may then get through later on • Can also increase the false positive rate • Difficult for the attacker to train the filter if no feedback is given (critical to protection) • Can be prevented with periodic retraining
Conclusion • Bayesian Filtering considered the best • Adaptive solution • Can look at more than just the message body • Inherently multilingual • Individuals/corporations can have their own filter which learns from their message behavior • Difficult to circumvent for attackers • Requires an initial learning period
References • http://www.paulgraham.com/spam.html • http://www.paulgraham.com/better.html • http://www.gfi.com/whitepapers/why-bayesian-filtering.pdf • http://en.wikipedia.org/wiki/Bayesian_spam_filtering • http://www.symantec.com/theme.jsp?themeid=state_of_spam • http://en.wikipedia.org/wiki/Anti-spam_techniques • ftp://ftp.research.microsoft.com/users/joshuago/papers-2005/125.pdf