170 likes | 212 Views
Spam Detection Kingsley Okeke Nimrat Virk. Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients by email. They impede our ability to recognise normal emails. They can also be a threat to computer security. Everyone hates spams!!.
E N D
Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients by email. • They impede our ability to recognise normal emails. • They can also be a threat to computer security Everyone hates spams!!
What is Text Mining?? • Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output .. wikipedia Text Mining!!
Marketing applications • It is used to improve predictive analytic models for customers • E.gOpen ended questions in surveys • Online Media applications • Used by Large media companies to provide users with better search experience • Academic applications • Publishers with large databases use text mining for easy information retrieval Applications
Using text mining we can analyse patterns common in spam emails in order to distinguish them from Ham emails.
1) Get some training data • A large collection of spam and normal emails • SpamAssassin public corpus (http://www.spamassassin.org/publiccorpus/) Steps
2) Data Pre-processing a) Stop words: e.g for, when, to, a , be • Domain specific stop words e.g email, send Steps
b) Stemming: removal of stems/roots from words • E.g discussed – discussing - discuss • Porter stemming algorithm • One of the most widely used stemming algorithm • Developed by Martin Porter http://www.tartarus.org/~martin/PorterStemmer/ Steps
c) Feature Selection What are Good and Bad Features? Good features: Must occur alongside with a particular category Do not co-occur with other categories Bad features: Uniform across all categories Very infrequent occurrence Steps
Information Gain • A common feature selection technique used in machine learning applications. information gain of term t is defined as: Steps
Feature Representation Steps
TF: Term Frequency • Definition: TF = t (i,j) • frequency of term i in document j • Purpose: makes the frequent words for the document more important • TF-IDF (Term Frequency - Inverted Document Frequency) • value of a term i in document j • Definition: TF×IDF = t(i,j) × log(N/ni) • ni: number of documents containing term i • N : total number of documents Steps
d) Text Classification • WEKA • Training data is used to build a classification model • This model is built from the pre-processed data Steps