Improving Spam Detection Based on Structural Similarity

Improving Spam DetectionBased on Structural Similarity Authors: Luiz Henrique Gomes, Fernando Castro, Rodrigo Almeida, Luis Bettencourt, Virgílio Almeida Publish: Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005 Presenter: Danzhou Liu

Introduction • Volume of spam traffic is increasing sharply • 83% of all incoming e-mails in 2005 vs. 24% in January 2003 • Current detection techniques are not fully successful • Spammers evade detection by frequently changing e-mail characteristics traditionally used for detection/filtering • E-mail content and subjects • False positives: legitimate e-mails are misclassified as spam. • high cost to end-users • Goals: • Improve spam detection by reducing the number of false positives CDA6938

The Proposed Algorithm • Assumption: contact lists change less frequently than other characteristics • Set of recipients targeted by a sender is likely to remain stable for longer periods than E-mail content and subjects • Exploit structural relationships between senders and recipients • Senders and recipients are clustered based on similarity of their contact lists • Historical information is used CDA6938

Modeling SimilarityAmong Senders and Recipients • Vectorial representation of an e-mail sender: S0 r0 S1 r1 Similarly, S2 r2 S3 CDA6938

Modeling SimilarityAmong Senders and Recipients • Vectorial representation of an e-mail sender: • Vectorial representation of a sender cluster: • Similarity between a sender and a sender cluster: Note: Similar representations for recipients CDA6938

The Proposed Algorithm CDA6938

The Proposed Algorithm Sender Clusters Here sim=0.9 sim=0.5 Ps( , , )=0.8 Sender For any incoming e-mail From: To: sim=0.2 sim=0.3 1. Classify incoming e-mail by auxiliary spam detection method 2. Find similarity between sender and each sender cluster Auxiliary Detection Method 3. Add sender to cluster that is most similar to it as long as similarity > τ 4. Calculate PS, the spam probability of its sender’s cluster (auxiliary classification of this e-mail and previous e-mails sent by cluster) 5. Repeat process with the recipients, calculate PR (average spam probability of all its recipients’ clusters) Spam? 6. Finally determine whether the incoming e-mail is spam based on PS and PR CDA6938

Ps( , , )=0.8 PR(( ), ( , , , ))=0.5 The Proposed Algorithm Here Sender Clusters Sender For any incoming e-mail From: To: Recipients Clusters Here Auxiliary Detection Method Recipients Here Spam? CDA6938

Ps( , , )=0.8 PR(( ), ( , , , )=0.5 Classification • Classify the e-mail as spam if the point (PS, PR) falls in the blue area • Classify the e-mail as legitimate if the point (PS, PR) falls in the green area • Implemented by computing Spam Rank PR(m) Spam 0.5 Legitimate 0.8 PS(m) CDA6938

Ps( , , )=0.8 PR(( ), ( , , , )=0.5 Spam Rank Computation The Spam Rank vector is: The Spam Rank (SR) is the norm of the projection of over diagonal If SR>ω: classify e-mail as spam Else If SR<1-ω: classify it as legitimate Otherwise, use classification reported by auxiliary algorithm PR(m) Spam 1-ω ω 0.5 SR Legitimate 0.8 PS(m) CDA6938

Experiments • Experimental Data Set • Auxiliary Spam Detection Method • Spam Assassin CDA6938

Selecting the Similarity Threshold τ • Number of sender/recipient clusters is roughly stable for τ ≥ 0.5, therefore use τ = 0.5 in experiments CDA6938

Effectiveness of Spam Rank (a) Bin Size = 0.25 (a) Bin Size = 0.10 • Clusters with high PS / PR send/receive large number of spam • Clusters with low PS/PR send/receive large number of legitimate e-mails CDA6938

E-mail Classification • Higher ω indicates smaller number of e-mails can be classified • Tradeoff between the total number of e-mails that are classified and the accordance with the previous classification provided by the original classifier algorithm CDA6938

Accuracy of Spam Detection • τ = 0.5 , ω = 0.85: • 879 e-mails were manually analyzed for possible false positives CDA6938

Strengths of the Paper • New spam detection algorithm that exploits structural similarities of senders and recipients • Clustering senders/recipients based on contact lists • Using historical information of each cluster can improve accuracy of existing detection algorithms • Reduce number of false positives caused by Spam Assassin CDA6938

Weaknesses of the Paper • Run slow because of high dimensional vectors which leading to computational overhead • Assumption may not always hold • Cannot handle forged e-mail addresses • False negative is possibly high CDA6938

Improvements of the Paper • Dimension reduction for clustering • Majority voting by combining several spam detection techniques to reduce both false positives and false negatives • Consider spam probability of a sender-recipient pair • Further evaluation with logs covering longer period (more than eight-day logs) CDA6938

Q&A CDA6938

Improving Spam Detection Based on Structural Similarity

Improving Spam Detection Based on Structural Similarity

Presentation Transcript

Improving Digest-Based Collaborative Spam Detection

Object Recognition Based on Shape Similarity

[Image Similarity Based on Histogram]

Retrieving Similar Code Fragments based on Identifier Similarity for Defect Detection

On Link-based Similarity Join

Bayesian Bot Detection Based on DNS Traffic Similarity

Opinion Spam Detection

Continuous Similarity-Based Queries on

Spam Email Detection

Feature Based Similarity

Feature Based Similarity

Network-Level Spam Detection

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

Trust based web spam detection in semantic search engine

Spam Detection

Improving Spam Detection Based on Structural Similarity

Structural Similarity Index

[Image Similarity Based on Histogram]

Similarity based deduplication

Object Recognition Based on Shape Similarity