330 likes | 342 Views
This course covers knowledge-based systems for spam filtering, including Bayesian approaches. Learn to classify emails using keywords and rules. Understand the terminology and implementation process. Explore various methods to handle spam automatically and evaluate their effectiveness.
E N D
CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly
Introduction Knowledge Representation Semantic Nets, Frames, Logic Reasoning and Inference Predicate Logic, Inference Methods, Resolution Reasoning with Uncertainty Probability, Bayesian Decision Making Expert System Design ES Life Cycle CLIPS Overview Concepts, Notation, Usage Pattern Matching Variables, Functions, Expressions, Constraints Expert System Implementation Salience, Rete Algorithm Expert System Examples Conclusions and Outlook Course Overview
Motivation Objectives Chapter Introduction Spam Terminology Dealing with Spam Laws and Regulations Filtering via Keywords Filtering via Rules Learning Spam and Bayes Binary Classification of Documents N-ary Classification Implementation SpamBayes Project Related Projects Important Concepts and Terms Summary Overview Spam Filtering
Logistics • Introductions • Course Materials • textbooks (see below) • lecture notes • PowerPoint Slides will be available on my Web page • handouts • Web page • http://www.csc.calpoly.edu/~fkurfess • Term Project • Lab and Homework Assignments • Exams • Grading
Motivation • dealing with spam “manually” is very time-consuming , tedious, and prone to errors • various methods have been tried to “filter” spam, with varying success • early results with Bayesian approaches look very promising
Objectives • be familiar with the terminology • spam • Bayesian approaches • to understand • elementary methods for handling spam automatically • more advanced methods • scenarios and applications for those methods • important characteristics • differences between methods, advantages, disadvantages, performance, typical scenarios • to evaluate the suitability of approaches for specific tasks • binary classification • n-ary classification • to be able to apply Bayesian filtering • spam • similar problems
Spam • broadly: any email that is not wanted by the recipient • similar to paper “junk” mail • easily recognized by recipients • unsolicited bulk email • not requested by the recipients • automatically sent out to a large number of recipients • “optional” characteristics • disguised or forged sender, return addresses and email forwarding information • questionable contents • illegal, unethical, fraudulent, ... • hidden activities • acknowledgement of receipt, spyware (“Web bugs”), virus
Terminology • spam terms • spam: negative (bad stuff) • ham: positive (good stuff) • Filtering terms • false negative • spam incorrectly classified as hamspam “gets through” • false positive • ham incorrectly classified as spamvalid messages are blocked • corpus • body of documents (email messages) • hapax, hapax legomenon • unique word in a specific message • sample or training set • messages used to train the system • test set • messages used to evaluate the system http://spambayes.sourceforge.net/
Filtering Spam • Keywords • Rules • Learning
Keywords • identify keywords that frequently occur in spam • simple and efficient • all incoming messages are checked for the occurrence of these keywords • if a message contains any or several of them, it is blocked • the list of keywords can be modified easily • not very accurate • many false positives • legitimate messages that happen to include “forbidden” words • many false negatives • can be easily circumvented • used in some early email filtering and Web blocking tools • little to moderate success
Rules • characteristics of spam messages are described through if ... then rules • not too complicated, moderately efficient • characteristics can be combined • not only keywords • also formatting, headers • more accurate • fewer false positives • allows a better description of spam messages • fewer false negatives • somewhat more difficult to circumvent
Learning • samples of good (ham) and bad (spam) messages are given to the system before it is deployed • the system analyses various criteria, and tries to determine which criteria are most valuable for the distinction • used earlier for general email categorization • assignment of messages to folders • suggestion of actions to be performed (e.g. reply, delete, forward) • spam was not a problem at that time
Spam and Bayes • Binary Classification of Documents • two bins: • spam, ham • sometimes an implicit “undecided” bin is used • N-ary Classification • uses n bins • “sure spam”, ”probably spam”, “maybe spam”, “unclear”“maybe ham”, “probably ham”, “sure ham” • Related Approaches • neural networks instead of Bayesian filtering • essentially also uses statistical techniques
Binary Classification of Documents • documents are parsed, and tokens extracted • pieces of the message that may serve as classification criteria • determined by the developer • the number of occurrences for each token is calculated • done for two corpora: one ham, one spam • results in two tables with occurrences of tokens in ham and spam • a third table is created that reflects the probability of a message being ham or spam
Calculation of Probabilities • Tokenizer • Scoring • Training • Testing http://spambayes.sourceforge.net/
Tokenizer • breaks up a mail message into a series of tokens • usually words or word stems • sometimes complete phrases • may consider non-textual elements • message headers, HTML constructs, images, comments • it can be difficult to identify meaningful tokens • message body tokens • embedded URLs • message headers • correlation between different types of clues http://spambayes.sourceforge.net/
Scoring • assigns a number to each message • 0 definite ham • 1 definite spam • most difficult and sensitive part of the system • incorrect scores • false positives • false negatives • unjustified confidence • scores are mostly close to 0 and 1, and rarely in between • improvements through using two separate probabilities • ham probability • spam probability • allows better treatment of unknown cases as “unsure” • substantial reduction of false positives and false negatives http://spambayes.sourceforge.net/
Training • presentation of examples for ham and spam • generates the probabilities used by the scoring system to assign values to new messages • corpus size • usually the larger, the better • too large may lead to overtraining • the number of ham and spam examples should be roughly equal • corpus quality • representative samples are very valuable • better quality can make up for lack of quantity • avoid misleading cues • e.g. recent spam vs. old ham; tags added by the mail system http://spambayes.sourceforge.net/
Testing • messages categorized as ham or spam are used for testing the performance of the system • frequently the existing collection of categorized messages is divided into a training and a testing set • intuitive insights often don’t work well • HTML tags • exclamation marks in the header • MESSAGES WRITTEN IN CAPITALS • cross-validation • formal technique that systematically divides the corpus into various combinations of training and test sets http://spambayes.sourceforge.net/
Results • performance results are notoriously difficult to compare • message corpus • training methods • threshold • cut-off value for spam • “magic numbers” • parameters adjusted by the developer or user
Selected Results • based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b] • 99.75 filtering rate on 1750 messages over 1 month • 4 false negatives: spam got through • usage of mostly legitimate words • neutral text with an innocent-sounding URL • 3 false positives: ham got blocked • newsletters sent through commercial emailers • almost spam • email that happens to have features typically associated with spam • ALL CAPITALS, <FF0000>, in-line images, URLs
Token Probabilities • based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b] Subject*FREE 0.9999 free!! 0.9999 To*free 0.9998 Subject*free 0.9782 free! 0.9199 Free 0.9198 Url*free 0.9091 FREE 0.8747 From*free 0.7636 free 0.6546
N-ary Classification • more than two categories • similar techniques as in the binary approach • can be substantially more complex
Related Approaches • collaborative filtering • many people categorize messages as spam, and submit them to a central system • also should have ham samples • may “wash out” individual differences • neural networks • similar concepts, but different learning methods
Implementation • SpamBayes Project [SpamBayes] • stand-alone filter • plug-in for some popular mail programs • Related Projects • SpamAssassin http://spamassassin.org/ • combines statistical techniques, rules, black-lists, collaborative filtering • see Paul Graham’s list of spam filters at http://www.paulgraham.com/filters.html
Future Work • extension to more sophisticated tokens • phrases • letters replaced by visually similar symbols • e.g. o/0, l/1 • separators inserted between characters • spam -> s p a m, s-p-a-m • combination with other approaches • blacklists, whitelists, rule-based systems, ... • genetic algorithms • construction of filters through evolution
References • [Graham, 2003a] Paul Graham, A Plan for Spam. http://www.paulgraham.com/spam.html, August 2002. • [Graham, 2003b] Paul Graham, Better Bayesian Filtering. http://www.paulgraham.com/better.html, January 2003. • [SpamBayes]SpamBayes : Bayesian anti-spam classifier written in Python.http://spambayes.sourceforge.net/, visited Feb. 2003 • [Robinson, 2002] Gary Robinson's Rants: Spam Detection. http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html, Dec. 2002. [A revised version is to appear in the March 2003 issue of the Linux Journal, http://www.linuxjournal.com/. [Giarratano & Riley 1998]
agenda backward chaining common-sense knowledge conflict resolution expert system (ES) expert system shell explanation forward chaining inference inference mechanism If-Then rules knowledge knowledge acquisition knowledge base knowledge-based system knowledge representation Markov algorithm matching Post production system problem domain production rules reasoning RETE algorithm rule working memory Important Concepts and Terms