190 likes | 214 Views
This paper presents an innovative spam detection algorithm leveraging structural similarities among senders and recipients to reduce false positives. By clustering based on contact lists and historical data, it enhances existing techniques and minimizes misclassifications.
E N D
Improving Spam DetectionBased on Structural Similarity Authors: Luiz Henrique Gomes, Fernando Castro, Rodrigo Almeida, Luis Bettencourt, Virgílio Almeida Publish: Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005 Presenter: Danzhou Liu
Introduction • Volume of spam traffic is increasing sharply • 83% of all incoming e-mails in 2005 vs. 24% in January 2003 • Current detection techniques are not fully successful • Spammers evade detection by frequently changing e-mail characteristics traditionally used for detection/filtering • E-mail content and subjects • False positives: legitimate e-mails are misclassified as spam. • high cost to end-users • Goals: • Improve spam detection by reducing the number of false positives CDA6938
The Proposed Algorithm • Assumption: contact lists change less frequently than other characteristics • Set of recipients targeted by a sender is likely to remain stable for longer periods than E-mail content and subjects • Exploit structural relationships between senders and recipients • Senders and recipients are clustered based on similarity of their contact lists • Historical information is used CDA6938
Modeling SimilarityAmong Senders and Recipients • Vectorial representation of an e-mail sender: S0 r0 S1 r1 Similarly, S2 r2 S3 CDA6938
Modeling SimilarityAmong Senders and Recipients • Vectorial representation of an e-mail sender: • Vectorial representation of a sender cluster: • Similarity between a sender and a sender cluster: Note: Similar representations for recipients CDA6938
The Proposed Algorithm CDA6938
The Proposed Algorithm Sender Clusters Here sim=0.9 sim=0.5 Ps( , , )=0.8 Sender For any incoming e-mail From: To: sim=0.2 sim=0.3 1. Classify incoming e-mail by auxiliary spam detection method 2. Find similarity between sender and each sender cluster Auxiliary Detection Method 3. Add sender to cluster that is most similar to it as long as similarity > τ 4. Calculate PS, the spam probability of its sender’s cluster (auxiliary classification of this e-mail and previous e-mails sent by cluster) 5. Repeat process with the recipients, calculate PR (average spam probability of all its recipients’ clusters) Spam? 6. Finally determine whether the incoming e-mail is spam based on PS and PR CDA6938
Ps( , , )=0.8 PR(( ), ( , , , ))=0.5 The Proposed Algorithm Here Sender Clusters Sender For any incoming e-mail From: To: Recipients Clusters Here Auxiliary Detection Method Recipients Here Spam? CDA6938
Ps( , , )=0.8 PR(( ), ( , , , )=0.5 Classification • Classify the e-mail as spam if the point (PS, PR) falls in the blue area • Classify the e-mail as legitimate if the point (PS, PR) falls in the green area • Implemented by computing Spam Rank PR(m) Spam 0.5 Legitimate 0.8 PS(m) CDA6938
Ps( , , )=0.8 PR(( ), ( , , , )=0.5 Spam Rank Computation The Spam Rank vector is: The Spam Rank (SR) is the norm of the projection of over diagonal If SR>ω: classify e-mail as spam Else If SR<1-ω: classify it as legitimate Otherwise, use classification reported by auxiliary algorithm PR(m) Spam 1-ω ω 0.5 SR Legitimate 0.8 PS(m) CDA6938
Experiments • Experimental Data Set • Auxiliary Spam Detection Method • Spam Assassin CDA6938
Selecting the Similarity Threshold τ • Number of sender/recipient clusters is roughly stable for τ ≥ 0.5, therefore use τ = 0.5 in experiments CDA6938
Effectiveness of Spam Rank (a) Bin Size = 0.25 (a) Bin Size = 0.10 • Clusters with high PS / PR send/receive large number of spam • Clusters with low PS/PR send/receive large number of legitimate e-mails CDA6938
E-mail Classification • Higher ω indicates smaller number of e-mails can be classified • Tradeoff between the total number of e-mails that are classified and the accordance with the previous classification provided by the original classifier algorithm CDA6938
Accuracy of Spam Detection • τ = 0.5 , ω = 0.85: • 879 e-mails were manually analyzed for possible false positives CDA6938
Strengths of the Paper • New spam detection algorithm that exploits structural similarities of senders and recipients • Clustering senders/recipients based on contact lists • Using historical information of each cluster can improve accuracy of existing detection algorithms • Reduce number of false positives caused by Spam Assassin CDA6938
Weaknesses of the Paper • Run slow because of high dimensional vectors which leading to computational overhead • Assumption may not always hold • Cannot handle forged e-mail addresses • False negative is possibly high CDA6938
Improvements of the Paper • Dimension reduction for clustering • Majority voting by combining several spam detection techniques to reduce both false positives and false negatives • Consider spam probability of a sender-recipient pair • Further evaluation with logs covering longer period (more than eight-day logs) CDA6938
Q&A CDA6938