10 likes | 84 Views
LINGER – A Smart Personal Assistant for E-mail Classification James Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney, Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au. Goal: Learn to automatically File e-mails into folders
E N D
LINGER – A Smart Personal Assistant for E-mail ClassificationJames Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney,Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au • Goal: Learn to automatically • File e-mails into folders • Filter spam e-mail • Motivation • Information overload - we are spending more and more time filtering e-mails and organizing them into folders in order to facilitate retrieval when necessary • Weaknesses of the programmable automatic filtering provided by the modern e-mail software (rules to organize mail into folders or spam mail filtering based on keywords): - Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it - Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders - The nature of the e-mails within the folder may well drift over time and the characteristics of the spam e-mail (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone • Results • Performance Measures • accuracy (A), recall (R), precision (P) and F1measure • Stratified 10-fold cross validation Filing Into Folders Overall Performance • The simpler feature selector V is more effective than IG • U2 and U4 were harder to classify than U1, U3 and U5: - different classification styles: • U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when e-mails needed to be acted upon (e.g. ThisWeek) • large ratio of the number of folders over the number of e-mails for U2 and U4 • Comparison with other classifiers • Effect of Normalization and Weighting Accuracy [%] for various normalization (e – e-mail, m- mailbox, g - global) and weighting (freq. – frequency, tf-idf and boolean) • Best results: mailbox level normalization, tf-idf and frequency weighting Spam Filtering • Cost-Sensitive Performance Measures • Blocking a legitimate message is times more costly than non-blocking a spam message • Weighted accuracy (WA): when a legitimate e-mail is misclassified/correctly classified, this counts as errors/successes Overall Performance Performance on spam filtering for lemm corpora • 3 scenarios: =1, no cost (flagging spam e-mail), =9, moderately accurate filter (notifying senders about blocked e-mails) and =999, highly accurate filter (completely automatic scenario) • LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees Effect of Stemming and Stop Word List– do not help Performance on the 4 different versions of LingSpam Anti-Spam Filter Portability Across Corpora • a) and b) – low SP; many fp (non spam as spam) - Reason: different nature of legitimate e-mail in LingSpam (linguistics related) and U5Spam (more diverse) • - Features selected from LingSpam are too specific, not a good predictor for U5Spam (a) • b) – low SR as well; many tp (spam as non-spam) - Reason: U5Spam is considerably smaller than LingSpam a) Typical confusion matrices b) • c) and d) – good results -Not perfect feature selection but the NN is able to recover by training • More extensive experiments with diverse, non topic specific corpora, are needed to determine the portability of anti-spam filters across different users • LINGER Based on Text Categorization • Bags of wordsrepresentation – all unique words in the entire training corpus are identified • Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V) • Feature representation – normalized weighting for every word, representing its importance in the document - Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf) - Normalization at 3 levels (e-mail, mailbox, corpus) • Classifier – neural network (NN), why? - require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+) - NN trained with backpropagation, 1 output for each class, 1 hidden layer of 20-40 neurons; early stopping based on validation set or a max. # epochs (10 000) Pre-processing for words extraction • E-mail fields used - body, sender (From, Reply-to), recipient (To, CC, Bcc) and Subject (attachments seen as part of body) - These fields are treated equally and a single bag of word representation is created for each e-mail - No stemming or stop wording were applied. • Discarding words that only appear once in a corpus • Removing words longer than 20 char from the body • # unique words in a corpus reduced from 9000 to 1000 • Corpora Filing into folders Spam e-mail filtering • 4 versions of PU1 and LingSpam depending on whether stemming and stop word list were used (bare, lemm,stop and lemm_stop)