Density-Based Spam Detector

RESEARCH TRACK Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining INDUSTRIAL TRACK Density-Based Spam Detector (Acceptance rate = 40/337 = 12%) Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356-8502, Japan {fujikawa,yamazaki}@kddilabs.jp Fuminori ADACHI Takashi WASHIO Hiroshi MOTODA ISIR, Osaka University 8-1, Mihogaoka, Ibarakishi, Osaka 567-0047,Japan {adachi,washio,motoda}@ar .sanken.osaka-u.ac.jp

Density-Based Spam Detector • A new spam detection method which use document space density information • The use of document space density • An efficient implementation through the use of a direct-mapped cache • Purpose • For the spam filter which is used in the mail server, it has to be: • High processing speed • maintain Easily • High accuracy • Privacy protection

Hash DB Hash DB System Architecture Unsupervised learning-- solve privacy problem, maintenance problem Hash table-- solve processing speed problem Similarity, threshold-- solve spam filter accuracy problem Hash table Hash feature Read features Find feature by N-gram Mail corpus calculate similarity Write/update features, similar email, email calculate similarity > SPAM threshold SPAM An incoming email

Related work • Bayesian-like approach • Rule-based approach • Checksum data base • http://www.dcc-servers.net/dcc/ • Vector representation • Hash-based text representation • Text retrieval、text compression、spam filtering • Direct-mapped cache is used to replace for LRU

Density-based spam detector • Document space density • Count the number of similar e-mails • By counting the number of similar emails, the simple threshold is enough to distinguish spam from other emails.

Mail System Design • Mass Mail Detector Monitoring network packets Analyzes SMTP traffic between mail servers and reconstructs the text of emails Transfers the text into vector representation

Hash design • Hash based representations • Hash values of each length L substring are calculated, and then the first N of them are used as vector representation of the email

Caching Architecture • Direct-mapped cache architecture • The hash data base store hash values of email and number of similar emails. • The direct-mapped cache is a simple hash table and the algorithms can find the entries of the of emails which have the same hash value through this cache.

Similar emails • To check a single piece of email, in order to find similar previous e-mail which share S% of the same hash values • Algorithm

Experimental results Distribution of similar e-mail follows Zipf’s law Summary of experimental results

Experimental results Recall rate Cache usage Log

Experimental results Comparison with other methods Testing method

Experimental results Bsfilter -2 group of mail list S:528 mails H:1538 mails Training:H+1/2S Testing:1/2S Result: After some period, bsfilter missclassified most of mails Reason: change topics Effect of topic change

Experimental results Effect of On-line Learning

Maintenance and privacy • Supervised learning methods require a positive and negative example of spam • Someone has to check the contents of the mail manually and therefore has the potential to violate privacy. • Although our method requires a white list, maintaining such a white list is relatively easy, especially comparing it to maintaining a black list.

Conclusion • High processing speed • 13000 emails per second (1.25 billion emails per day) • Maintenance free • 98% recall rate and 100% precision • Privacy protection

Density-Based Spam Detector

Density-Based Spam Detector

Presentation Transcript

Improving Digest-Based Collaborative Spam Detection

Topic9: Density-based Clustering

Density based Clustering

IMS Based Explosive Detector

Spam, Spam, Spam, Spam….

Detector Array Controller Based on

The Earthquake Detector (based on Zhang Heng’s Earthquake detector)

PPM based Spam Filtering in SEWM2008

Two Density-based Clustering Algorithms

SVM based Spam Filtering in SEWM2007

Aggregation Pheromone Density Based Clustering

Cloud-Based Spam Filtering

Density-based Approaches

Spam

Parallel Density-based Hybrid Clustering

Spam, Spam, Spam, Spit and Spim

Two Density-based Clustering Algorithms

Improving Digest-Based Collaborative Spam Detection