1 / 30

A Unified Model of Spam Filtration

William Yerazunis 1 , Shalendra Chhabra 2 , Christian Siefkes 3 , Fidelis Assis 4 , Dimitrios Gunopulos 2 1 Mitsubishi Electric Research Labs, Cambridge, MA 2 University of California Riverside, CA 3 GKVI/ FU Berlin, Germany 4 Embractel, Rio de Janeiro, Brazil.

temira
Download Presentation

A Unified Model of Spam Filtration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. William Yerazunis1, Shalendra Chhabra2, Christian Siefkes3, Fidelis Assis4, Dimitrios Gunopulos2 1Mitsubishi Electric Research Labs, Cambridge, MA 2University of California Riverside, CA 3GKVI/ FU Berlin, Germany 4Embractel, Rio de Janeiro, Brazil A Unified Model of Spam Filtration

  2. “Spam is a Nuisance”Spam Statistics

  3. The Problem of Spam Filtering Given a large set of email readers (who desire to reciprocally communicate without prearrangement with each other) and another set of email spammers (who wish to communicate with the readers who do not wish to communicate with the spammers), the set of filtering actions are the steps readers take to “maximize their desired communication and minimize their undesired communications”.

  4. Motivation • All current spam filters have very similar performance – why? • Can we learn anything about current filters by modeling them? • Does the model point out any obvious flaws in our current designs?

  5. What do you MEAN?Similar Performance? • Gordon Cormack's “Mr. X” test suite – all filters (heuristic, statistical) get about 99% accuracy. • Large variance of reported real-user results for all filters – everything from “nearly perfect” to “unusable” on the same software. • Conclusion: This problem is evil.

  6. Method • Examine a large number of spam filters • Think Real Hard (Feynmann method of deduction) • Come up with a unified model for spam filters • Examine the model for weaknesses and scalability • Consider remedies for the weaknesses

  7. The Model • Models information flow • Considers stateful v. stateless computations and computational complexity • Simplification: training is usually done “offline”, so we will assume a stationary (no learning during a filtering) filter configuration for the computational model.

  8. Empirical Stages in Filtering Pipeline • Preprocessing – text-to-text transform • Tokenization • Feature Generation • Statistics Table Lookup • Mathematical Combination • Thresholding

  9. Empirical Stages in Filtering Pipeline • Preprocessing – text-to-text transform • Tokenization • Feature Generation • Statistics Table Lookup • Mathematical Combination • Thresholding Everybody does it this way!

  10. Empirical Stages in Filtering Pipeline • Preprocessing – text-to-text transform • Tokenization • Feature Generation • Statistics Table Lookup • Mathematical Combination • Thresholding Everybody does it this way! Including SpamAssassin

  11. The Obligatory Flowchart • Note that this is now a pipeline – anyone for a scalable solution?

  12. Step 1: Pre Processing: Arbitrary Text to Text Transformation • Character Set Folding / Case Folding • Stopword Removal • MIME Normalization / Base64 Decoding • HTML Decommenting Hypertextus Interruptus • Heuristic Tagging “FORGED_OUTLOOK_TAGS” • Identifying Lookalike Transformations ‘@’ instead of ‘a’, $ instead of S

  13. Step 2: Tokenization • Converting incoming text (and heuristic tags) into features • Two step process: • Use regular expression (regex) to segment the incoming text into tokens. • We will then convert this token stream T in features. This conversion is not necessarily 1:1

  14. Step 3: Converting Tokens into Features • A “Token” is a piece of text that meets the requirements of the tokenizing regex • A “Feature” is a mostly-unique identifier that the filter's trained database can convert into a mathematically manipulable value

  15. Converting Tokens into Featuresfirst- convert Text to ArbInteger • For each unknown text token, convert that text into a (preferably unique) arbitrary integer. • This conversion can be done by dictionary lookup (i.e. “foo” is the 82,934th word in the dictionary) or by hashing. • Output of this stage is a stream of unique integer IDs. Call this row vector T.

  16. Second part:Matrix-based Feature Generation • Use a rectangular feature generation profile matrix P; each column of P is a vector containing small primes (or zero) • The dot product of each column of P against a segment of the token stream T is a unique encoding of a segment of the feature stream at that position. • The number of features out per element of T is equal to the column rank of P • Zero elements in a P column indicate that token is disregarded in this particular feature

  17. Matrix-based Feature GenerationExample – Unigram + Bigram features • A P matrix for unigram plus bigram features is: 1 2 0 3 • If the incoming token stream T is 10, 20, 30... then successive dot products of T•P yield: 10 * 1 + 20 * 0 = 10 <--- first position, first column 10 * 2 + 20 * 3 = 80 <-- first position, second column 20 * 1 + 30 * 0 = 30 <--- second position, first column 20 * 2 + 30 * 3 = 130 <-- second position, second col

  18. Matrix-based Feature GenerationNice Side Effects • Zeroes in a P matrix indicate “disregard this location in the token stream” • Identical primes in a column of a P matrix allow order-invariant feature outputs, eg: 1 2 0 2 generates the unigram features, AND the set of bigrams in an order-invariant style (ie. “foo bar” == “bar foo”)

  19. Matrix-based Feature GenerationNice Side Effects II • Identical primes in different columns of a P matrix allow order-sensitive distance-invariant feature outputs, eg: 1 2 2 0 3 0 0 0 3 generates the unigram features and the set of bigrams in an order-sensitive distance-invariant style (ie. “foo bar” == “foo <skip> bar”)

  20. Step 4:Feature Weighting by Lookup • Feature Lookup tables are pre-built by learning side of filter EG: • Strict Probability: Local Weight = TimesSeenInClass TimesSeenOverAllClasses • Buffered probability: Local Weight =__________TimesSeenInClass (TimesSeenOverallClasses + constant) • Document Count Certainty: Weight= TimesSeenInThisClass * DocumentsTrainedIntoThisClass (TimesSeenOverallClasses+Constant)*TotalDocumentsActuallyTrained

  21. In this model, Weight Generators Do Not Have To Be Statistical • SpamAssassin: Weights generated by genetic optimization algorithm • SVM: weights generated by linear algebra • Winnow Algorithm: • uses additive weights stored in the database • Each features weight starts at 1.0000 • Weights are multiplied by a promotion / demotion factor if the results are below /above a predefined threshold during learning

  22. Unique Features -->Nonuniform Weights • Because all features are unique, we can have nonuniform weighting schemes • Can have nonuniform weights compiled into the weighting lookup tables • Can pass a row vector of weights from the feature generation column vectors

  23. Step 5: Weight Combination1stStateful Step • Winnow Combining • Bayesian Combining • Chi squared Combining • Sorted Weight Approach - “Only use the most extreme N weights found in the document” • Uniqueness – use only 1st occurrence of any feature Output: a state vector – may be of length 1, or longer, and may be nonuniform.

  24. Final Threshold • Comparison to either a fixed threshold or • Comparison of one part of the output state to another part of the output state (useful when the underlying mathematical model includes a null hypothesis)

  25. Emulation of Other Filtering Methods In the Unified Filtration Model • Emulating Whitelists and Blacklists • Voting-style whitelists • Prioritized-rule Blacklists --> Compile entries into lookup tables with superincreasing weights for each whitelist or blacklisted term

  26. Emulation of Other Filtering Methods In the Unified Filtration Model Emulating Heuristic Filters: • Use text-to-text translation to insert text tag strings corresponding to each heuristic satisfied • Compile entries in the lookup table corresponding to the positive or negative weight of each heuristic text tag string; ignore all other text. • Sum outputs and threshold This is the SpamAssassin model

  27. Emulation of Other Filtering Methods In the Unified Filtration Model Emulating Bayesian Filters: • Text-to-text translation as desired • Use a unigram (SpamBayes) or digram (Dspam) P matrix to generate the features • Preload the lookup tables with local probabilities • Use Bayes Rule or Chi-Squared to combine probabilities, and threshold at some value This is just about every statistical filter...

  28. Conclusions • Everybody's filter is more or less the same! • But- filtration on only the text limits our information horizon. • We need to broaden the information input horizon ... but from where? how?

  29. Conclusions Better information input - examples: • CAMRAM – use outgoing email as auto whitelist • Smart Squid – look at your web site surfing to infer what might be legitimate email (say, receipts from online merchants) • Honey pots – sources of new spam, and newly compromised zombie spambots (a.k.a realtime blacklists, inoculation, peer-to-peer filter systems). Large ISPs have a big statistical advantage here over single users.

  30. Thank You! questions or comments?

More Related