Topic Models Based Personalized Spam Filter

Topic Models Based Personalized Spam Filter Sudarsun. S Director – R & D, Checktronix India Pvt Ltd, Chennai Venkatesh Prabhu. G Research Associate, Checktronix India Pvt Ltd, Chennai Valarmathi B Professor, SKP Engineering College, Thiruvannamalai ISCF - 2006

What is Spam ? • unsolicited, unwanted email • What is Spam Filtering ? • Detection/Filtering of unsolicited content What’s Personalized Spam Filtering ? • Definition of “unsolicited” becomes personal • Approaches • Origin-Based Filtering [ Generic ] • Content Based-Filtering [ Personalized ] ISCF - 2006

Content Based Filtering • What does the message contain ? • Images, Text, URL • Is it “irrelevant” to my preferences ? • How to define relevancy ? • How does the system understands relevancy ? • Supervised Learning • Teach the system about what I like and what I don’t • Unsupervised Learning • Decision made using latent patterns ISCF - 2006

Content-Based Filtering -- Methods • Bayesian Spam Filtering • Simplest Design / Less computation cost • Based on keyword distribution • Cannot work on contexts • Accuracy is around 60% • Topic Models based Text Mining • Based on distribution of n-grams (key phrases) • Addresses Synonymy and Polysemy • Run-time computation cost is less • Unsupervised technique • Rule based Filtering • Supervised technique based on hand-written rules • Best accuracy for known cases • Cannot adopt to new patterns ISCF - 2006

Topic Models • Treats every word as a feature • Represents the corpus as a higher-dimensional distribution • SVD: Decomposes the higher-dimensional data to a small reduced sub-space containing only the dominant feature vectors • PLSA: Documents can be understood as a mixture of topics • Rule Based Approaches • N-Grams – Language Model Approach • More common n-grams  more closer the patterns are. ISCF - 2006

LSA Model, In Brief • Describes underlying structure among text. • Computes similarities between text. • Represents documents in high-dimensional Semantic Space (Term – Document Matrix). • High dimensional space is approximated to low-dimensional space using Singular Value Decomposition (SVD). • Decomposes the higher dimensional TDM to U, S, V matrices. U: Left Singular Vectors ( reduced word vectors ) V: Right Singular Vector ( reduced document vectors ) S: Array of Singular Values ( variances or scaling factor ) ISCF - 2006

PLSA Model • By PLSA model, a document is a mixture of topics and topics generate words. • The probabilistic latent factor model can be described as the following generative model • Select a document difrom D with probability Pr(di). • Pick a latent factor zkwith probability Pr(zk|di). • Generate a word wjfrom W with probability Pr(wj|zk). Where • Computing the aspects model parameters using EM Algorithm ISCF - 2006

N–Gram Approach • Language Model Approach • Looks for repeated patterns • Each word depends probabilistically on the n-1 preceding words. • Calculating and Comparing the N-Gram profiles. ISCF - 2006

Training Mails Test Mail Preprocessor LSA Model PLSA Model N-Gram …. Other Classifiers Combiner Final Result Overall System Architecture ISCF - 2006

Preprocessing • Feature Extraction • Tokenizing • Feature Selection • Pruning • Stemming • Weighting • Feature Representation • Term Document Matrix Generation Sub Spacing • LSA / PLSA Model Projection • Feature Reduction • Principle Component Analysis ISCF - 2006

Principle Component Analysis - PCA • Data Reduction - Ignore the features of lesser significance • Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data • The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) • To detect structure in the relationship between variables that is used to classify data. ISCF - 2006

MxR M: Vocab Size R: Rank RxR’ R: InVar Size R’: OutVar Size Input Mails LSA Model PCA BPN Token List Vector 1xR R: Rank Vector 1xR’ LSA Classification Score ISCF - 2006

MxZ M: Vocab Size R: Aspects Count ZxZ’ Z: InVar Size Z’: OutVar Size Input Mails PLSA Model PCA BPN Token List Vector 1xZ Z: Aspects Vector 1xZ’ PLSA Classification Score ISCF - 2006

(P)LSA Classification • Model Training • Build the Global (P)LSA model using the training mails. • Vectorize the training mails using LSI/PSLA model • Reduce the dimensionality of the matrix of pseudo vectors of training documents using PCA. • Feed the reduced matrix into neural networks for learning. Model Testing • Test mails is fed to (P)LSA for vectorization. • Vector is reduced using PCA model. • Reduced vector is fed into BPN neural network. • BPN network emits its prediction with a confidence score ISCF - 2006

N-Gram method • Construct an N-Gram tree out of training docs • Documents make the leaves • Nodes make the identified N-grams from docs • Weight of an N-gram = Number of children • Higher order of N-gram implies more weight • Weight Wt  Wt * S / ( S + L ) • P: Total number of docs sharing a N-Gram • S: Number of SPAM docs sharing N-Gram • L: P - S ISCF - 2006

3rd 1st 2nd 2nd N4 N3 N1 N2 T2 T5 T1 T4 T3 An Example N-Gram Tree ISCF - 2006

Combiner • Mixture of Experts • Get Predictions from all the Experts • Use the maximum common prediction • Use the prediction with maximum confidence score ISCF - 2006

Conclusion • Objective is to Filter mail messages based on the preference of an individual • Classification performance increases with increased (incremental) training • Initial learning is not necessary for LSA, PLSA & N-Gram. • Performs unsupervised filtering • Performs fast prediction although background training is a relatively slower process ISCF - 2006

References [1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Naïve Bayesian Anti-Spam Filtering”, Proc. of the workshop on Machine Learning in the New Information Age, 2000. [2]W. Cohen, “Learning rules that classify e-mail”, AAAI Spring Symposium on Machine Learning in Information Access, 1996. [3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, “TiMBL: Tilburg Memory-Based Learner - version 4.0 Reference Guide”, 2001. [4] H. Drucker, D. Wu, and V. N. Vapnik., “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural networks, 1999. [5] D. Mertz, “Spam Filtering Techniques. Six approaches to eliminating unwanted e-mail.”, Gnosis Software Inc., September, 2002. Ciencias Físicas, Universidad de Valencia, 1992. [6] M. Vinther, “Junk Detection using neural networks”, MeeSoft Technical Report, June 2002. Available: http://logicnet.dk/reports/JunkDetection/JunkDetection.htm. [7] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. “Indexing By Latent Semantic Analysis”, Journal of the American Society For Information Science, 41, 391-407. (1990) [8] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, and Sathish Kumar Veeraswamy,”Role of Weighting on TDM in Improvising Performance of LSA on Text Data”, Proceedings ofIEEE INDICON 2006. [9] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22 Int’l SIGIR Conf. on Research and Development in Information Retrieval, 1999 [10]Sudarsun Santhiappan, Dalou Kalaivendhan and Venkateswarlu Malapatti .”Unsupervised Contextual Keyword Relevance Learning and Measurement using PLSA”, Proceedings of IEEE INDICON 2006. [11]Landauer, T. K., Foltz, P. W., & Laham, D. “Introduction to Latent Semantic Analysis”, DiscourseProcesses, 25, 259-284. (1998). [12]G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, "Information retrieval using a singular value decomposition model of latent semantic structure," in The 11th International Conference on Research and Development in Information Retrieval, Grenoble, France: ACM Press, pp. 465--480. (1988) [13] Damashek, M. Gauging , “Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text”. Science, 267. 843-848. [14] Sholomo Hershkop, Salvatore J.Stolfo , “Combining Email models for False Positive Reduction”, KDD’05, August 2005. ISCF - 2006

Any Queries…. ? You can post your queries to sudar@burning-glass.com ISCF - 2006

Topic Models Based Personalized Spam Filter