Data Mining & MacHine learning Final Project

Data Mining & MacHinelearning Final Project Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

Outline • Experiment setting • Feature extraction • Model training • Hybrid-Model • Conclusion • Reference

Experiment setting • Selected online corpus: enron • Removing html tags • Factoring important headers • Six folders from enron1 to enron6. • Contain totally 13496 spam mails & 15045 ham mails

Feature Extration • Transmitted Time of the Mail • Number of the Receiver • Existence of Attachment • Existence of images in mail • Existence of Cited URLs in mail • Symbols in Mail Title • Mail-body

Transmitted Time of the Mail& Number of the Receiver Spam: Non-uniform Distribution Spam: Only Single Receiver

Probability of being Spam for Transmitted Time & Receiver Size

Attachment, Images, and URL

Symbols in Mail Titles • Title Absentness • Spam senders add titles now. • Arabic Numeral : • Almost equal probability (Date, ID) • Non-alphanumeric Character & Punctuation Marks: Appear more often in Spam Appear more often in ham

Mail-body • Build the internal structure of words • Use a good NLP tool called Treetaggerto help us do word stemming • Given the stemmed words appeared in each mail, we build a sparse format vector to represent the “semantic” of a mail

Naïve Bayes Given a bag of words (x1, x2, x3,…,xn), Naïve Bayes is powerful for document classification.

Vector Space Model Create a word-document (mail) matrix by SRILM. For every mail (column) pair, a similarity value can be calculated.

KNN (Vector Space Model) As K = 1, the KNN classification model show the best accuracy.

Maximum Entropy • Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution. • The elements in word-document matrix are modified to the binary value {0, 1}.

SVM • Binary : • Select binary value {0,1} to represent that this word appears or not • Normalized : • Count the occurrence of each word and divide them by their maximum occurrence counts.

Single-layered-perceptronHybrid Model The accuracy of NN-based Hybrid Model is always the highest.

Committee-based Hybrid-model • The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority. • Knn + naïve Bayes + Maximum Entropy • naïve Bayes + Maximum Entropy + SVM

Conclusion • 7 features are shown mail type discrimination. • Transmitted Time & Receiver Size • Attachment, Image, and URL • Non-alphanumeric Character & Punctuation Marks • 5 populous Machine Learning are proved suitable for spam filter • Naïve Bayes, KNN, SVM • 2 Model combination ways are tested. • Committee-based & Single Neural Network

Reference • [1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk E-Mail," in Proc. AAAI 1998, Jul. 1998. • [2] A plan for spam: http://www.paulgraham.com/spam.html • [3]Enron Corpus: http://www.aueb.gr/users/ion/ • [4]Treetagger:http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html • [5]Maximum Entropy: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html • [6]SRILM: http://www.speech.sri.com/projects/srilm/ • [7]SVM:http://svmlight.joachims.org/

Data Mining & MacHine learning Final Project