80 likes | 372 Views
Spam? Not any more !! Detecting spam emails using neural networks. ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan. Importance of the topic. Spam is unsolicited and unwanted emails Wastage of bandwidth, storage space and most of all, recipient’s time.
E N D
Spam? Not any more !!Detecting spam emails using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan
Importance of the topic • Spam is unsolicited and unwanted emails • Wastage of bandwidth, storage space and most of all, recipient’s time Goals of the Anti-spam Network • Reliably block spam mails • Should not block any non-spam mails, but can allow few spam mails to slip through • Adapt to the specific types of messages
Input Features – Data Set • Original data set: 57 input attributes • Output attribute: 1 (for spam) 0 (for nonspam) • Inputs derived from email content • Attributes indicate the frequency of specific words and characters • Examples: ‘credit’, ‘free’ (in spam) ‘meeting’, ’project’, (in nonspam)
Preprocess the data • Choose only the inputs which differ for spam and non-spam mails • Two reduced data sets are obtained (21 Inputs and 9 Inputs) • The data is made zero mean, unit variance (4025 Input Vectors) • Split the data into two independent training and testing data sets
MLP Implementation • Learning by back propagation algorithm • Using complete data set • Poor performance (Classification rate: 63.2%) • Classified most of the mails as non-spam • Using reduced data set (Inputs – 21) • Good performance (Classification rate: 93.8%) • All the non-spam is detected • Optimal MLP Configuration: 20-10-10-10-7
Cross Validation • Using reduced data set (Inputs – 9) • Good performance (Classification rate: 92.1%) • Nearly all the non-spam is detected • Optimal MLP Configuration: 20-10-10-8 • Using Cross - Validation • Negligible improvement in performance • Since all the data is derived from the same source, cross validation offers no advantage
Inference of the results • Larger number of inputs does not necessarily improve the performance • It is important to remove redundant and irrelevant features • There is no optimum MLP configuration for all inputs – need to adapt depending on the email content • A combination of other types of spam filters along with neural networks can be used
Conclusion • Neural networks are a viable option in spam filtering • A number of heuristic methods are being increasingly applied in this field • Need to exploit the differences between spam and ‘good’ emails • Further opportunities • Data sets from different sources need to be used for training • Fuzzy logic and combinational algorithms can be used in this application