1 / 12

PPM based Spam Filtering in SEWM2008

PPM based Spam Filtering in SEWM2008. Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn ,billpengpeng@sohu.com oillgz@gmail.com College of Computer Science, Zhejiang University April 10, 2008. Outline. PPM( prediction by partial matching )

suzuki
Download Presentation

PPM based Spam Filtering in SEWM2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PPM based Spam Filteringin SEWM2008 Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn,billpengpeng@sohu.com oillgz@gmail.com College of Computer Science, Zhejiang University April 10, 2008

  2. Outline PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification

  3. PPM Data Compression

  4. PPM Framework

  5. Email Pre-processing Source alphabet Merge continuous spaces Truncate long messages

  6. Email Pre-processing Sample: Alphabet : {a,b,c,d,e,f,_,=, } Replace char: ? Truncate length: 20 Raw Data Abcd_= - Af?/[]=+ safj =ab fe addfe After Replace abcd_= ? Af????=? ?af? =ab fe addfe After Merge Blank abcd_= ? Af????=? ?af? =ab fe addfe After Truncate abcd_= ? Af????=? ?a

  7. Train PPM Model • Use order-6 PPM* model • Use Method D Escape estimation • Train Two PPM model • HAM Model • SPAM Model

  8. Model Classification MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score

  9. Advantage Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive

  10. Reference 《Spam Filtering Using Statistical Data Compression Models》 《Unbounded Length Contexts for PPM》

  11. Question • Delay Index • ham, Ham and HAM • Active learning 10000 • Deliver the filter

  12. Thanks for your attention! Q&A

More Related