70 likes | 230 Views
SVM based Spam Filtering in SEWM2007. Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cn , oillgz@gmail.com , xucongfu@zju.edu.cn College of Computer Science, Zhejiang University March 11, 2007. Chinese Anti-spam Framework. Outline. Email Pre-processing Feature Extraction
E N D
SVM based Spam Filtering in SEWM2007 Pan Weike, Lu Guanzhong, Xu Congfu panweike@zju.edu.cn, oillgz@gmail.com, xucongfu@zju.edu.cn College of Computer Science, Zhejiang University March 11, 2007
Outline • Email Pre-processing • Feature Extraction • Support Vector Regression
Email Pre-processing • Some problems: • An email may contain more than 2 charset types. • The charset information of some emails are missing. • An efficient approach to obtain the accurate charset information of each email is needed.
Feature Extraction • Tokenization: Tianwang Chinese algorithm • http://net.pku.edu.cn/~webg/src/ChSeg/ • Without Feature Selection: TF,CHI,IG, etc. • VSM: TF*IDF, subject:body=3:1
Support Vector Regression • SVR toolbox: libSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/