510 likes | 1.12k Views
Email Spam Filtering Computer Security Seminar. N.Muthiyalu Jothir – 271120 Media Informatics. Agenda. What is Spam ? Statistics Who Benefits from it? Spam Filtering Techniques Combining Filters Conclusion. What is Spam?. Spam Unsolicited email
E N D
Email Spam FilteringComputer Security Seminar N.Muthiyalu Jothir – 271120 Media Informatics Email Spam Filtering - Muthiyalu Jothir
Agenda • What is Spam ? • Statistics • Who Benefits from it? • Spam Filtering Techniques • Combining Filters • Conclusion Email Spam Filtering - Muthiyalu Jothir
What is Spam? • Spam Unsolicited email • Emails that involves sending identical or nearly identical messages to thousands (or millions) of recipients. • Caution ! “SPAM - Spiced Ham ” is a popular American canned meat brand… Email Spam Filtering - Muthiyalu Jothir
Problem • With a tiny investment, a spammer can send over 100,000 bulk emails per hour. • Junk mails waste storage and transmission bandwidth. • ISP’s investment Cost we absorb as ISP’s customer • Spam is a problem because the cost is forced onto us, the recipient. Email Spam Filtering - Muthiyalu Jothir
Statistics Email Spam Filtering - Muthiyalu Jothir
Who benefits from Spam? Financial Firms e.g. Mortgage Information about interested customers Recipient replies here Lead Generators (Gain 2% of Loan value per customer data) Recipient Spammers (Share the profit with Lead Generators) Email Spam Filtering - Muthiyalu Jothir
Spam Control Techniques Fight Back techniques Filtering Techniques • Reporting Spam to ISP • Fight back filters • Slow Senders • Law ??? • etc. • Challenge-Response Filtering • Blacklists and White lists • Content based filters • Rule based • Bayesian filters Email Spam Filtering - Muthiyalu Jothir
Reporting Spam To ISPs • Original spam solution • Legitimate ISPs respond to such complaints • Spammers kicked off Disadvantage • Disguised Spammers. • Naïve users cannot interpret the email headers Email Spam Filtering - Muthiyalu Jothir
Filters that Fight Back (FFB) • Majority of spam contain links to web pages. • Spam filters could auto retrieve the URLs and crawl back to those pages, which would increase the load on the server. • If all the spam receivers do this at the same time, the server might be crashed and so the cost of spamming increases. Caution ! • FFB usually works with blacklists (of malicious servers) in order to avoid the attack on innocent servers. Email Spam Filtering - Muthiyalu Jothir
Filtering Techniques Email Spam Filtering - Muthiyalu Jothir
Spam Vs Ham • Care to be taken in any Spam filtering technique • “All the Spam could be allowed to pass thro; but, not even a single legitimate mail should be filtered.” • False Positive – Legitimate mail classified as spam. • Least false positive rate desired… • Caution : Check your junk folder before deleting • Don’tbelieve your Spam filter Email Spam Filtering - Muthiyalu Jothir
Challenge-Response Filtering • Emails from unknown senders will receive an auto-reply message asking them to verify themselves • Senders “Challenged" to type in a word that is hidden within a graphic or a sound file • Mail is forwarded to receiver’s inbox, only after successful “response” • This technique almost filters all spam . No spammer would be interested to take the extra effort to prove him / her self. • Commercial product “spamarrest” Disadvantage • This technique is rude • Sometimes senders don’t or forget to reply to the challenge Email Spam Filtering - Muthiyalu Jothir
Blacklists and White lists • Blacklists of misbehaving servers or known spammers that are collected by several sites. • Sender id in the email is compared with the blacklist • White lists are complementary to black lists, and contain addresses of trusted contacts • Use blacklists and white lists for the first level filtering (before applying content checks) and not used as the only tool for making decision. Disadvantage • Prone to wrong configurations with legitimate servers unable to exit from a list where they had been incorrectly inserted. Email Spam Filtering - Muthiyalu Jothir
Content based filters • Not a good idea to filter mails just based on blacklists • Wiser decision Consider the actual content of the email • Almost all the successful spam filters use this technique • Major types : Rule-based and Bayesian Email Spam Filtering - Muthiyalu Jothir
Rule Based Filters • Rule based filters work based on some static rules to decide whether a mail is a spam or not. • Rules could be • words and phrases • lots of uppercase characters • exclamation points • special characters • Web links • HTML messages • background colors • crazy Subject lines etc. Email Spam Filtering - Muthiyalu Jothir
Rule based filters • Rules are given scores, based on importance • Incoming mails are parsed and checked for known malicious patterns • Total score calculated for the triggered rules • If Final Score > Threshold, classify as spam. Otherwise, classify as legitimate mail. • Threshold decided by the user. Email Spam Filtering - Muthiyalu Jothir
Rule Based Filters • “Spamassasin”, a popular spam filtering product uses rule based filtering. • Perl Regex (Regular expressions) used for pattern checking • Example rules • header __LOCAL_FROM_NEWS From /news@example\.com/i • body __LOCAL_SALES_FIGURES /\bMonthly Sales Figures\b/ • score LOCAL_NEWS_SALES_FIGURES 0.8 Email Spam Filtering - Muthiyalu Jothir
Rule Based Filters • Advantage • Easy to implement • No training required • Disadvantage • Static rules too general • Spammers find new ways to deceive the rules Email Spam Filtering - Muthiyalu Jothir
Bayesian Filters • Bayesian filters are the latest in spam filtering technology and the most successful. • Bayes classifiers were used extensively in the field of pattern recognition. • Given an unlabeled example, the classifier will calculate the most likely classification with some degree of probability. Email Spam Filtering - Muthiyalu Jothir
Bayesian Filters • Steps in Bayes Filtering • Training • Validation • Implementation • Training starts with two collections of mails : one of spam and one of legitimate mail. • For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences. • Bayesian filters are quite accurate, and adapt automatically as spam evolves. • False positives are minimized by Bayesian filtering because they consider evidence of innocence as well as evidence of spam. Email Spam Filtering - Muthiyalu Jothir
Bayesian Filtering • Bayes Probability, Pr (spam | words) = Pr (spam) * Pr (words | Spam) Pr (words) • Probabilitycloser to 1 would be classified as spam and closer to 0 is classified as ham. • 0.5 is set as the threshold. Email Spam Filtering - Muthiyalu Jothir
Neural Network for Training • Neural Network Structure i Email Spam Filtering - Muthiyalu Jothir
Neural Networks for Training • Neural networks are used to train the spam filter (Rule-based or Bayesian) and itself is not a filter • Input words or rules etc. • Trained over multiple samples of the user’s mails (both spam and ham) • Weights of the links are altered till the desired output is obtained. Email Spam Filtering - Muthiyalu Jothir
Supervised Learning • Supervised learning Training with a “teacher” signal • Train the system till we get optimized unaltered weights for the edges. Caution! • Take care not to over train the network. Email Spam Filtering - Muthiyalu Jothir
Combining Spam Filters • Goal Combined filter aims to improve individual filters performance. • Combined Filter = Original Filter (OF) + Received Filter (RF) • Max gain Received filter contains some feature sets not found in the original filter. • E.g. Original Filter = {“Share Market”, “Higher Studies”} Received filter = {“Share Market”, “Job Alerts”} Email Spam Filtering - Muthiyalu Jothir
Challenges • Decisions (Spam / Ham) made by both filters individually • Decisions agree No Problem • Disagreement Due to difference of feature sets • Challenges • “How do we select the correct decision or filter?” • “Who selects it?” Email Spam Filtering - Muthiyalu Jothir
Filter Selector (FS) • Training Phase FS predicts the unique features (e.g. words) of RF • Parse the emails of training set and extract the features • ‘Bag’ of (predicted) features for RF • Text similarity comparison between the current e-mail's features and the feature sets of the filters. Email Spam Filtering - Muthiyalu Jothir
Algorithm Flowchart Training Phase Final Verdict Email Spam Filtering - Muthiyalu Jothir
TF – IDF Similarity Measure • Commonly used in Information Retrieval applications. • More frequent words would be key to accurate classification of emails • FS predicted feature set is unique • “Query – Document” retrieval procedure. • 2 documents – Feature sets • Query – Current email Email Spam Filtering - Muthiyalu Jothir
Experiments & Results Email Spam Filtering - Muthiyalu Jothir
Conclusion • We discussed the techniques to “kill” spam • Comparison between various techniques • So far, Bayesian seems to be reliable • Discussed a new approach to combine filters • Futurework : • Learning techniques for Filter Selector • Better Similarity measures Email Spam Filtering - Muthiyalu Jothir
Thank You Email Spam Filtering - Muthiyalu Jothir