310 likes | 965 Views
SPAM FILTERING. By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014. What is Spam Email? . “junk email” or “unsolicited commercial email”. Spam filtering - a special case of email classification. Only 2 classes – Spam and Non-spam. Various Approaches. Bayesian Learning
E N D
SPAM FILTERING By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014
What is Spam Email? • “junk email” or “unsolicited commercial email”. • Spam filtering - a special case of email classification. • Only 2 classes – Spam and Non-spam.
Various Approaches • Bayesian Learning • Probabilistic model for Spam Filtering • Bag of Words Representation • Ripper algorithm • Context Sensitive Learning. • Boosting algorithm • Improving Accuracy by combining weaker hypotheses.
Naive Bayes for Spam • Seeking model to find P(Y=1/X1=x1,X2=x2,..,Xd=xd) • From Bayes theorem P(Y=1/X1=x1,..,Xd=xd) = P(Y=1) * P(X1=x1,..,Xd=xd/Y=1) P(X1=x1,..Xd=xd) P(Y=0/X1=x1,..,Xd=xd) = P(Y=0) * P(X1=x1,..,Xd=xd/Y=0) P(X1=x1,..Xd=xd)
Justification of using Bayes Theorem • Sparseness of data • P(B/A) can be easily and accurately determined as compared to P(A/B)
Naive Bayes for Spam(contd.) Assume P(X1=x1,..,Xd=xd/Y=k) = ∏ P(Xi=xi / Y=k) Also assume Xi = 1 if no of occurrence of word i >= 1 = 0 otherwise
Naive Bayes for Spam(contd.) referred to as weights of evidence • Inconsistency when some probability is zero. • Smooth the estimates by adding a smooth positive constant to both numerator and denominator of each probability estimate
Classifying • Assume new mail with text • “The quick rabbit rests” 0.51 + 0.51 + 0 + 0.51 + 1.10 + 0 = 2.63 Probability = 0.93
Threshold • Lower threshold • Higher false positive rate • Higher threshold • Higher false negative rate • Preferred
Non-Linear Classification Linear Classifier Ignores the effect of Context of word on its meaning. Unrealistic . Build a linear classifier that test for more complex Features like Simultaneous Occurrences. High Computation Cost !! Non-Linear Classification is the Solution
Ripper • Disjunction of Different Contexts • Each Contexts is conjunction of Simple terms • Context of w1 is : if w2 belongs to data and w3 belongs to data. i.e. for context to be true w1 must occur with w2 and w3. Three Components of Ripper Algorithm:
Rule Learning : • Spam spam Є Subject • Spam Free Є Subject ,Spam Є Subject. • Spam Gift!! Є Subject, Click Є Subject. • The rule would be disjunction of three statements stated above. • There is an initial set of rules too
Constructing Rule Set • Initial Rule Set is Constructed Using a greedy Strategy. • Based on the IREP (Incremental Reduced Error Pruning) • To Construct A new Rule partitioning Dataset into two parts training Set And Pruning Set is Done. • Every Time a Single condition is Added to Rule.
Simplification And Optimization • At every step the density of +ve examples covered is increased. • Adding stops until clause cover no –ve example or there is no positive gain. • After this, pruning i.e. simplification is done. At every stage, again following greedy Strategy
Reaching Sufficient Rules • The clause is deleted which maximizes the Function where U+(i+1) and U-(i-1) are the positive and negative examples. • Termination when information gain is non-zero i.e. every rule covers +ve examples. • But If data is noisy then number of rules increase
MDL • Several heuristics are applied to solve the problem. MDL(Minimum Description Length) is one of them. • After addition of each rule , total length of current rule set and example is calculated. • Addition of rule is stopped when this length is d bits larger than shortest length.
AdaBoost • Easy to find rule of thumb which are often correct • If ‘buy now’ occurs in message, then predict ‘spam’ • Hard to find one rule which is very accurate • AdaBoost helps here • general method of converting rough rules of thumb into highly accurate prediction rule • Concentrating on hard examples
Algorithm • Input S = { (Xi , Yi) } mi=1 • Initialize D(i) = 1/m for all i • For i = 1 to T • H(t) = WeakLearner(S,Dt) • Choose βt • ln((1-ε)/ε) (proven to Minimize error for 2class) [2] • Update Dt+1(i) = Dt(i) exp(-βtYiht(xi)) and Normalize • Final Hypotheses f(x) = ∑βt ht(x)
Accuracy • Weighted accuracy measure • (λL- + S+) / (λL + S) • λ strictness measure • L : # legitimate messages S : # spam • L- : #legitimate messages classified as legitimate • S+ : #spam classified as spam • Improving accuracy • Increase λ • Introduce θ threshold • Example classified positive only if f(x) > θ • Default is ZERO • Recall correctly predicted spam out of number of spam in corpus • Precision correctly classified spam out of number predicted as spam
Pros and Cons • Fast and Simple • No parameters to tune • Flexible • Can combine with any learning Algorithm • No knowledge needed of WeakLearner • Error reduces exponentially • Robust to overfitting • Data Driven – requires lots of data • Performance depends on WeakLearner • May fail if WeakLearner is too weak
Conclusion • RIPPER as text categorization algorithm works better than Naïve Bayes (better for more classes). • Comparable for spam filtering (2 classes) • Boosting better than any weak learner it works on.
References • [1] Boosting trees for Anti Spam Email Filtering by Xavier Carreras and Llius Marquez 2001. • [2] The boosting approach to machine learning: An overview. by Robert E. Schapire in MSRI Workshop on Nonlinear Estimation and Classification, 2002. • [3] Statistics and The War on Spam by David Madigan, David Madigan, 2004. • [4] Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000. http://citeseer.ist.psu.edu/androutsopoulos00evaluation.html • [5] William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: 307-315