Spam Detection Kingsley Okeke Nimrat Virk

Spam DetectionKingsley Okeke Nimrat Virk

Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients by email. • They impede our ability to recognise normal emails. • They can also be a threat to computer security Everyone hates spams!!

But how do we filter out spams from normal emails?? ?? ??

What is Text Mining?? • Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output .. wikipedia Text Mining!!

Marketing applications • It is used to improve predictive analytic models for customers • E.gOpen ended questions in surveys • Online Media applications • Used by Large media companies to provide users with better search experience • Academic applications • Publishers with large databases use text mining for easy information retrieval Applications

Using text mining we can analyse patterns common in spam emails in order to distinguish them from Ham emails.

1) Get some training data • A large collection of spam and normal emails • SpamAssassin public corpus (http://www.spamassassin.org/publiccorpus/) Steps

2) Data Pre-processing a) Stop words: e.g for, when, to, a , be • Domain specific stop words e.g email, send Steps

b) Stemming: removal of stems/roots from words • E.g discussed – discussing - discuss • Porter stemming algorithm • One of the most widely used stemming algorithm • Developed by Martin Porter http://www.tartarus.org/~martin/PorterStemmer/ Steps

c) Feature Selection What are Good and Bad Features? Good features: Must occur alongside with a particular category Do not co-occur with other categories Bad features: Uniform across all categories Very infrequent occurrence Steps

Information Gain • A common feature selection technique used in machine learning applications. information gain of term t is defined as: Steps

Feature Representation Steps

TF: Term Frequency • Definition: TF = t (i,j) • frequency of term i in document j • Purpose: makes the frequent words for the document more important • TF-IDF (Term Frequency - Inverted Document Frequency) • value of a term i in document j • Definition: TF×IDF = t(i,j) × log(N/ni) • ni: number of documents containing term i • N : total number of documents Steps

d) Text Classification • WEKA • Training data is used to build a classification model • This model is built from the pre-processed data Steps

END

Spam Detection Kingsley Okeke Nimrat Virk

Spam Detection Kingsley Okeke Nimrat Virk

Presentation Transcript

Improving Digest-Based Collaborative Spam Detection

Charles Kingsley

Opinion Spam Detection

Email Spam Detection using machine Learning

Spam Email Detection

Spam, Spam, Spam, Spam….

Network-Level Spam Detection

Internet Level Spam Detection and SpamAssassin 2.50

Review Spam Detection via Temporal Pattern Discovery

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

Spam Detection

Improving Spam Detection Based on Structural Similarity

Naïve Bayes for Text Classification: Spam Detection

Virk music online free

Virk music online

Review Spam Detection via Temporal Pattern Discovery

Kingsley Davis defines

Nick Kingsley

Improving Digest-Based Collaborative Spam Detection