Email Spam Detection using machine Learning

Email Spam Detection using machine Learning Lydia Song, Lauren Steimle, XiaoxiaoXu

Outline • Introduction to Project • Pre-processing • Dimensionality Reduction • Brief discussion of different algorithms • K-nearest • Decision tree • Logistic regression • Naïve-Bayes • Preliminary results • Conclusion

Spam Statistics • Percentage of Spam Emails in email traffic averaged 69.9% in February 2014 Percentage of spam in email traffic Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014

Spam vs. Ham Spam=Unwanted communication Ham=Normal communication

Pre-processing Example of Spam Email Corresponding File in Data Set Spam Email in Web Browser

Pre-processing • Remove meaningless words • Create a “bag of words” used in data set • Combine similar words • Create a feature matrix “service” “last” “history” Email 1 Email 2 Email m

tokens= [‘your’, ‘history’, ‘shows’, ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, ‘ready’, ‘for’, ‘refilling’, ‘thank’, ‘you’, ‘sam’, ‘mcfarland’, ‘customer services’] Pre-processing Example Your history shows that your last order is ready for refilling. Thank you, Sam Mcfarland Customer Services filtered_words=[ 'history', 'last', 'order', 'ready', 'refilling', 'thank', 'sam', 'mcfarland', 'customer', 'services'] “servi” “histori” “last” Email 1 bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service'] Email 2 Email m

Dimensionality Growth • Add ~100-150 features for each additional email

Dimensionality Reduction • Add a requirement that words must appear in x% of all emails to be considered a feature

Dimensionality Reduction-Hashing Trick • Before Hashing: 70x9403 Dimensions • After Hashing: 70x1024 Dimensions String Integer Hash Table Index Source: Jorge Stolfi, http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svg#filelinks

K-Nearest Neighbors • Goal: Classify an unknown training sample into one of C classes • Idea: To determine the label of an unknown sample (x), look at x’s k-nearest neighbors Image from MIT Opencourseware

Decision Tree • Convert training data into a tree structure • Root node: the first decision node • Decision node: if–then decision based on features of training sample • Leaf Node: contains a class label Image from MIT Opencourseware

Logistic Regression • “Regression” over training examples • Transform continuous y to prediction of 1 or 0 using the standard logistic function • Predict spam if

Naïve Bayes • Use Bayes Theorem: • Hypothesis (H): spam or not spam • Event (e): word occurs • For example, the probability an email is spam when the word “free” is in the email • “Naïve”: assume the feature values are independent of each other

Preliminary Results • 250 emails in training set, 50 in testing set • Use 15% as the “percentage of emails” cutoff • Performance measures: • Accuracy: % of predictions that were correct • Recall: % of spam emails that were predicted correctly • Precision: % of emails classified as spam that were actually spam • F-Score: weighted average of precision and recall

“Percentage of Emails” Performance Linear Regression Logistic Regression

Preliminary Results

Next Steps • Implement SVM: Matlab vs. Weka • Hashing trick- try different number of buckets • Regularizations

Thank you! Any questions?

Email Spam Detection using machine Learning

Email Spam Detection using machine Learning

Presentation Transcript

Rare Category Detection in Machine Learning

EMAIL AND SPAM

Phishing and Spam Email

Machine Learning Basics with Applications to Email Spam Detection

Spam: Email Gone Wild!

Exploiting Machine Learning to Subvert Your Spam Filter

Opinion Spam Detection

Machine Learning Basics with Applications to Email Spam Detection

Spam Email Detection

Network-Level Spam Detection

The Fight against Spam - A Machine Learning Approach

Using Traffic Analysis to Detect Email Spam

Machine Learning for Network Anomaly Detection

Spam Email

Spam Detection

Email Spam Filtering Service

Machine learning using spark

Topic Detection using Machine Learning

Era of Sociology News Rumors News Detection using Machine Learning

Melanoma Skin Cancer Detection using Image Processing and Machine Learning

Fake News Detection using Machine Learning

Fake News Detection using Machine Learning