1 / 19

Improving Spam Detection Based on Structural Similarity

This paper presents an innovative spam detection algorithm leveraging structural similarities among senders and recipients to reduce false positives. By clustering based on contact lists and historical data, it enhances existing techniques and minimizes misclassifications.

jillc
Download Presentation

Improving Spam Detection Based on Structural Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Spam DetectionBased on Structural Similarity Authors: Luiz Henrique Gomes, Fernando Castro, Rodrigo Almeida, Luis Bettencourt, Virgílio Almeida Publish: Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005 Presenter: Danzhou Liu

  2. Introduction • Volume of spam traffic is increasing sharply • 83% of all incoming e-mails in 2005 vs. 24% in January 2003 • Current detection techniques are not fully successful • Spammers evade detection by frequently changing e-mail characteristics traditionally used for detection/filtering • E-mail content and subjects • False positives: legitimate e-mails are misclassified as spam. • high cost to end-users • Goals: • Improve spam detection by reducing the number of false positives CDA6938

  3. The Proposed Algorithm • Assumption: contact lists change less frequently than other characteristics • Set of recipients targeted by a sender is likely to remain stable for longer periods than E-mail content and subjects • Exploit structural relationships between senders and recipients • Senders and recipients are clustered based on similarity of their contact lists • Historical information is used CDA6938

  4. Modeling SimilarityAmong Senders and Recipients • Vectorial representation of an e-mail sender: S0 r0 S1 r1 Similarly, S2 r2 S3 CDA6938

  5. Modeling SimilarityAmong Senders and Recipients • Vectorial representation of an e-mail sender: • Vectorial representation of a sender cluster: • Similarity between a sender and a sender cluster: Note: Similar representations for recipients CDA6938

  6. The Proposed Algorithm CDA6938

  7. The Proposed Algorithm Sender Clusters Here sim=0.9 sim=0.5 Ps( , , )=0.8 Sender For any incoming e-mail From: To: sim=0.2 sim=0.3 1. Classify incoming e-mail by auxiliary spam detection method 2. Find similarity between sender and each sender cluster Auxiliary Detection Method 3. Add sender to cluster that is most similar to it as long as similarity > τ 4. Calculate PS, the spam probability of its sender’s cluster (auxiliary classification of this e-mail and previous e-mails sent by cluster) 5. Repeat process with the recipients, calculate PR (average spam probability of all its recipients’ clusters) Spam? 6. Finally determine whether the incoming e-mail is spam based on PS and PR CDA6938

  8. Ps( , , )=0.8 PR(( ), ( , , , ))=0.5 The Proposed Algorithm Here Sender Clusters Sender For any incoming e-mail From: To: Recipients Clusters Here Auxiliary Detection Method Recipients Here Spam? CDA6938

  9. Ps( , , )=0.8 PR(( ), ( , , , )=0.5 Classification • Classify the e-mail as spam if the point (PS, PR) falls in the blue area • Classify the e-mail as legitimate if the point (PS, PR) falls in the green area • Implemented by computing Spam Rank PR(m) Spam 0.5 Legitimate 0.8 PS(m) CDA6938

  10. Ps( , , )=0.8 PR(( ), ( , , , )=0.5 Spam Rank Computation The Spam Rank vector is: The Spam Rank (SR) is the norm of the projection of over diagonal If SR>ω: classify e-mail as spam Else If SR<1-ω: classify it as legitimate Otherwise, use classification reported by auxiliary algorithm PR(m) Spam 1-ω ω 0.5 SR Legitimate 0.8 PS(m) CDA6938

  11. Experiments • Experimental Data Set • Auxiliary Spam Detection Method • Spam Assassin CDA6938

  12. Selecting the Similarity Threshold τ • Number of sender/recipient clusters is roughly stable for τ ≥ 0.5, therefore use τ = 0.5 in experiments CDA6938

  13. Effectiveness of Spam Rank (a) Bin Size = 0.25 (a) Bin Size = 0.10 • Clusters with high PS / PR send/receive large number of spam • Clusters with low PS/PR send/receive large number of legitimate e-mails CDA6938

  14. E-mail Classification • Higher ω indicates smaller number of e-mails can be classified • Tradeoff between the total number of e-mails that are classified and the accordance with the previous classification provided by the original classifier algorithm CDA6938

  15. Accuracy of Spam Detection • τ = 0.5 , ω = 0.85: • 879 e-mails were manually analyzed for possible false positives CDA6938

  16. Strengths of the Paper • New spam detection algorithm that exploits structural similarities of senders and recipients • Clustering senders/recipients based on contact lists • Using historical information of each cluster can improve accuracy of existing detection algorithms • Reduce number of false positives caused by Spam Assassin CDA6938

  17. Weaknesses of the Paper • Run slow because of high dimensional vectors which leading to computational overhead • Assumption may not always hold • Cannot handle forged e-mail addresses • False negative is possibly high CDA6938

  18. Improvements of the Paper • Dimension reduction for clustering • Majority voting by combining several spam detection techniques to reduce both false positives and false negatives • Consider spam probability of a sender-recipient pair • Further evaluation with logs covering longer period (more than eight-day logs) CDA6938

  19. Q&A CDA6938

More Related