260 likes | 1.28k Views
Spam Email Detection. Ethan Grefe December 13, 2013. Motivation. Spam email is constantly cluttering inboxes Commonly removed using rule based filters Spam often has very similar characteristics This allows them to be detected using machine learning Naïve Bayes Classifiers
E N D
Spam Email Detection Ethan Grefe December 13, 2013
Motivation • Spam email is constantly cluttering inboxes • Commonly removed using rule based filters • Spam often has very similar characteristics • This allows them to be detected using machine learning • Naïve Bayes Classifiers • Support Vector Machines
SVM Solution • Used training data from CSDMC2010 SPAM corpus • 4327 labeled emails • 2949 non-spam messages (HAM) • 1378 spam messages (SPAM). • Extracted features from the subject and body of emails • Used resulting feature vectors to train an SVM classifier in Matlab
Email Features • Features were determined by research and observation • Best results were obtained with the following features • Percentage of letters that arecapitalized • Types of punctuation used • Average length ofa word • Amount of html in the email
Classifier Results • Trained on a random 35% of emails • Tested SVM classifier on remaining 65% • Trained SVM using three different kernel functions
Possible Improvements • Use Naïve Bayes to classify emails using word frequency • Obtain a wider variety of input features • Test other types of learning algorithms