120 likes | 264 Views
PPM based Spam Filtering in SEWM2008. Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn ,billpengpeng@sohu.com oillgz@gmail.com College of Computer Science, Zhejiang University April 10, 2008. Outline. PPM( prediction by partial matching )
E N D
PPM based Spam Filteringin SEWM2008 Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn,billpengpeng@sohu.com oillgz@gmail.com College of Computer Science, Zhejiang University April 10, 2008
Outline PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification
PPM Data Compression
Email Pre-processing Source alphabet Merge continuous spaces Truncate long messages
Email Pre-processing Sample: Alphabet : {a,b,c,d,e,f,_,=, } Replace char: ? Truncate length: 20 Raw Data Abcd_= - Af?/[]=+ safj =ab fe addfe After Replace abcd_= ? Af????=? ?af? =ab fe addfe After Merge Blank abcd_= ? Af????=? ?af? =ab fe addfe After Truncate abcd_= ? Af????=? ?a
Train PPM Model • Use order-6 PPM* model • Use Method D Escape estimation • Train Two PPM model • HAM Model • SPAM Model
Model Classification MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score
Advantage Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive
Reference 《Spam Filtering Using Statistical Data Compression Models》 《Unbounded Length Contexts for PPM》
Question • Delay Index • ham, Ham and HAM • Active learning 10000 • Deliver the filter