530 likes | 550 Views
Pattern Recognition using Support Vector Machine and Principal Component Analysis. Ahmed Abbasi MIS 510 3/21/2007. Outline. Background Support Vector Machine Classification Linear Kernel Applications: Text Categorization Non Linear Kernels Applications: Document Categorization
E N D
Pattern Recognition using Support Vector Machine and Principal Component Analysis Ahmed Abbasi MIS 510 3/21/2007
Outline • Background • Support Vector Machine • Classification • Linear Kernel • Applications: Text Categorization • Non Linear Kernels • Applications: Document Categorization • Ensemble Methods • Applications: Image Recognition • Regression and Feature Selection • Principal Component Analysis • Standard PCA • Applications: Style Categorization • Kernel PCA • Applications: Image Categorization • PCA Ensembles • Applications: Style Categorization • SVM and PCA Resources
Background • Statistical Pattern Recognition • Includes classic problems such as character recognition and medical diagnosis. • Machine learning algorithms have become popular for pattern recognition. • Due to enhanced computational power over the past 30-40 years. • Machines effective for structured and (in some cases) semi-structured problems. • Popular recent data mining applications include credit scoring, text categorization, image recognition.
Background The Feature Matrix • Data Mining Terminology • It’s important to firstly review some common data mining terms. • Data mining data is typically represented using a feature matrix. • Features • Attributes used for analysis • Represented by columns in feature matrix • Instances • Entity with certain attribute values • Represented by rows in feature matrix • An example instance is highlighted in red (also called a feature vector). • Class Labels • Indicate category for each instance. • This example has two classes (C1 and C2). • Only used for supervised learning. Features Attributes used to classify instances Each instance has a class label Instances
Background The Loan Data Feature Matrix • Loan Application Data Example • Machine learning algorithms are often used by financial institutions for making loan decisions. • Loan data is represented using a feature matrix. • Features • Credit score, loan amount, loan type, applicant’s income, etc. • Instances • Each instance represents a prior loan. • Class Labels • Two classes: whether the borrower honored the loan or defaulted. Features Attributes used to classify loan decision Prior loan instances used to classify future loans Instances
Background • Two broad categories of machine learning algorithms. • Supervised learning algorithms • Also called discriminant methods • Require training data with class labels • Some examples already discussed in previous lectures include Neural Networks and ID3/C4.5 Decision Tree algorithms. • Unsupervised learning algorithms • Non-discriminant methods • Build models based on training data, without use of class labels
Background • In this lecture, we will discuss two popular machine learning algorithms. • Support Vector Machine • Supervised learning method • Principal Component Analysis • Unsupervised learning methods
Support Vector Machine: Background • Grounded in Statistical Learning Theory, or VC (Vapnik-Chervonenkis) Theory. • Technique introduced in the mid 1990’s. • Developed at AT&T bell labs. • Some interesting extensions done at Microsoft Research. • The idea is to select a set of functions (called the hyperplane) that can minimize the sum of the empirical risk and VC dimensions.
Support Vector Machine: Background • The intuition behind SVM: VC Theory The VC confidence for a set of functions Proportional to the “capacity” of the function set The empirical risk of the training data An indicator of the function sets’ effectiveness. The best training model is one that minimizes these two: Lowest risk and lowest VC dimensions should hopefully result in the most accurate and generalizable model.
Support Vector Machine: Background • Linear Kernel • Uses a linear hyperplane to separate the different class instances. • The circled instances represent the support vectors. • These are the instances that set the boundaries on the hyperplane. • The distance between the hyperplane and support vectors represents the margin. • The hyperplane which maximizes this margin is used. • The greater the margin, the greater the likelihood that the SVM model will be generalizable.
Support Vector Machine: Classification • Linear Kernels for Text Categorization • Linear SVM has been used for a plethora of important text categorization problems: • Topic Categorization • Classifying a set of documents by topic • Sentiment Classification • Classifying online movie and/or product reviews as “positive” or “negative” • Style Classification • Categorizing text based on authorship (writing style)
Support Vector Machine: Classification • Topic Categorization • Motivation: Digital Libraries!!! • Arranging documents by topic is a natural way to organize information in online libraries. • Dumais et al. (1998) at Microsoft Research conducted an in depth topic categorization study comparing linear SVM with other techniques on the Reuters corpus. • Found that SVM outperformed other techniques on most topics as well as overall.
Support Vector Machine: Classification • Sentiment Categorization • Motivation: Market Research!!! • Gathering consumer preference data is expensive • Yet its also essential when introducing new products or improving existing ones. • Software for mining online review forums….$10,000 • Information gathered…….priceless. (www.epinions.com)
Support Vector Machine: Classification • Sentiment Classification Experiment • Objective to test effectiveness of features and techniques for capturing opinions. • Test bed of 2000 digital camera product reviews taken from www.epinions.com. • 1000 positive (4-5 star) and 1000 negative (1-2 star) reviews • 500 for each star level (i.e., 1,2,4,5) • Two experimental settings were tested • Classifying 1 star versus 5 star (extreme polarity) • Classifying 1+2 star versus 4+5 star (milder polarity) • Feature set encompassed a lexicon of 3000 positive or negatively oriented adjectives and word n-grams. • Compared C4.5 decision tree against SVM. • Both run using 10-fold cross validation.
Support Vector Machine: Classification • Sentiment Classification Experimental Results • SVM significantly outperformed C4.5 on both experimental settings. • The improved performance of SVM was attributable to its ability to better detect reviews containing sentiments with less polarity. • Many of the milder (2 and 4 star) reviews contained positive and negative comments about different aspects of the product. • It was more difficult for the C4.5 technique to detect the overall orientation of many of these reviews.
Support Vector Machine: Classification • Style Categorization • Motivation: Online Anonymity Abuse!!! • Ability to identify people based on writing style can allow the use of stylometric authentication. • Important for many online text-based applications: • Email scams (email body text) • Online auction fraud (feedback comments) • Cybercrime (forum, instant messaging logs) • Computer hacking (program code)
Support Vector Machine: Classification Style Categorization Experimental Results: Stylometric Identification using SVM Classification Accuracy (%) Linear SVM kernel was fairly effective for identifying up to 50 authors However, performance fell as number of authors increased (e.g., 100 authors). Thus, the use of a single SVM may not be appropriate as the number of author classes increases. Another problem is that the use of supervised techniques may not be suitable for online settings.
Support Vector Machine: Classification • More Complex Problems: Fraudulent Escrow Website Categorization • Motivation: Online Escrow Fraud nets billions of dollars in revenue annually!!! • Given the growing amount of fraudulent sellers/traders online, people are told to use escrow services for security. • So naturally, fake escrow websites have started to pop up. • Online fraud databases such as the Artists-Against-419 document an average of 30-40 new sites every day!!! • Especially prevalent for online sales of larger goods, such as vehicles.
Support Vector Machines: Classification • Fraudulent Escrow Website Categorization • Which of the following escrow websites are fake? ***All Of Them***
Support Vector Machine: Classification Same Text and Icon Same Image and Banner Same Page Design (HTML and URLs)
More Complex Problems: Fraudulent Escrow Website Categorization Websites contain many pages. Each page contains HTML, body text, images, URL and anchor text, and in/out links. Each of these forms of content are important for detecting fake escrow websites. Not necessarily more complex in terms of classification difficulty, but more representational complexity. Support Vector Machine: Classification
Support Vector Machines: Classification The Web Page Feature Matrix • Fraudulent Escrow Website Categorization • Using individual feature categories with a single linear SVM is no problem in this case. • However, if we wish to use all features, the one-to-many relationship between pages and images is problematic. • Also, what about site structure features? • E.g., in/out links, page level, etc. Features Attributes used to classify web pages Prior instances used to classify future pages Instances
Support Vector Machine: Classification • Fraudulent Escrow Website Categorization • A website contains many pages, and a page can contain many images, along with HTML, body text, URLs and anchor text, and site structure. • Important fake escrow classification characteristics: • Requires use of rich feature set (text, html, images, urls, etc.) • Some feature patterns/trends across fake sites • Some content duplication across fake sites • Web site structure may be important • A single linear SVM cannot handle such information…. • Two solutions: • Ensemble Classifiers • Non-linear Kernel
Support Vector Machines: Classification The Web Page Feature Matrix Features • Fraudulent Escrow Website Categorization • Ensemble Classifiers • Also referred to as voting based techniques. • Use multiple SVMs to distribute complex features. • This is called a feature based ensemble. • Each SVM classifier is an “expert” on one feature category. Attributes used to classify web pages Prior instances used to classify future pages Instances Body Text SVM HTML SVM URL SVM Image SVM
Support Vector Machines: Classification The Web Page Feature Matrix • Fraudulent Escrow Website Categorization • Nonlinear kernel • We can define our own kernel function. • Using this function, we can compute the similarity score between every page. • This matrix can then be input into a linear SVM. • Notice that the features are now the similarity scores for the pages. Kernel Function
Support Vector Machine: Classification • Fraudulent Escrow Website Categorization • An example kernel called “Escrow Kernel” • This kernel is customized to handle fraudulent escrow pages. • It considers the page structure, average page-site similarity, and max page-site similarity. • The Escrow Kernel is defined as follows:
Support Vector Machine: Classification • Fraudulent Escrow Website Categorization • Experimental Design • 50 bootstrap instances • Randomly select 50 real escrow sites and 50 fake web sites in each instance. • Use all the web pages from the selected 100 sites as the instances. • Each instance, use 10-fold CV for page categorization. • 90% pages used for training, 10% for testing in each fold. • Compare different feature categories discussed as well as use of all features with ensemble and kernel approach.
Support Vector Machine • Fraudulent Escrow Website Categorization • Experimental Results (Page level) • The linear kernel outperformed the escrow kernel on the text and html features. • The escrow kernel outperformed linear SVM on all other feature sets. • Both ensemble and all feature kernels outperformed the use of individual feature categories. Average classification accuracy (%) across 50 bootstrap runs *Linear Ensemble with 4 SVM Classifiers
Support Vector Machines: Classification • Style Categorization Revisited • Ensemble Classifiers • Can also be used across instances. • Use multiple SVMs to distribute complex classes. • This is called an instance or class based ensemble. • Each SVM classifier is an “expert” on one class. • Could be useful for style categorization scalability problem. Identity Feature Matrix Features ID 1 SVM Instances ID 3 SVM
Support Vector Machine: Classification Experimental Results: Stylometric Identification using SVM and Ensemble The use of the class-based ensemble outperformed the single SVM on three of four data sets. The exception being the Java Programming Forum. Generally the performance gap widened as the number of classes increased. Classification Accuracy (%)
Support Vector Machine: Classification • Kernel Function Examples • In both the examples on the right no linearly separable hyperplane is possible. • The top one uses the following second order monomials as features: • The bottom one shows how a 3rd degree polynomial kernel can be used.
Support Vector Machine: Classification • Popular Non-linear Kernel Functions • Polynomial Kernels • Gaussian Radial Basis Function (RBF) Kernels • Sigmoidal Kernels • Tree Kernels • Graph Kernels • Always be careful when designing a kernel • A poorly designed kernel can often reduce performance • The kernel should be designed such that the similarity scores or structure created by the transformation places related instances in a manner separable from unrelated instances. • Garbage in – garbage out • Live by the kernel….die by the kernel... • ***Insert preferred idiom here***
Support Vector Machine: Feature Selection • Most machine learning algorithms can also be used for feature selection. • Trained classifiers assign each feature a weight. • This can be used as an indicator of its effectiveness or importance. • For example, decision tree models (DTMs) have been used a lot. • Similarly, SVM is also highly effective. • Iteratively decrease the feature space by only selecting features over a threshold weight or the n best features. SVM Feature Set SVM Weights Selected Features
Support Vector Machine: Feature Selection • Sentiment Categorization • 2,000 movie review test bed • Performed 10 fold CV and 50 instances with a 1900-100 review split. • Used SVM to test sentiment polarity classification performance (positive vs. negative) • Compared no feature selection baseline with feature selection using information gain (IG), genetic algorithm (GA), and SVM weights (SVMW). • SVMW performed well, significantly outperforming the baseline and with the best overall accuracy, using the minimum set of features.
Support Vector Machine: Regression • SVM regression is designed to handle continuous data predictions. • Useful for problems where the classes lie along a continuum instead of discrete classes. • Stock Prediction • Predicting the impact a news story will have on a company’s stock price. • Sentiment Categorization • Differentiating 1,2,3,4, and 5 star movie and product reviews. • Often the difference between a 1 and 2 star review is very subtle. • Being able to make more precise predictions can be useful here.
Principal Component Analysis: Background • PCA is a popular dimensionality reduction technique • Been around since the early 1900’s • Still used a lot for text and image processing • Idea is to project data into lower dimension feature space. • Where variables are transformed into a smaller set of principal components that account for the important variance in the feature matrix. • Used a lot for: • Data preprocessing/filtering • Feature selection/reduction • Classification and clustering • Visualization
Principal Component Analysis: Background The Feature Matrix The Projected Matrix Features Principal Components 1) 2) PCA Instances Instances will load heavily on P1
Principal Component Analysis: Classification • Use of principal component analysis for authorship and genre analysis of texts using 50 function words and 2D plots. No authorship structure or clustering using top 3 components. Due to lack of feature richness. Some structure based on education level of author. Some clustering based on genre. Fiction are different from description and argument.
Principal Component Analysis: Classification Author PCA Scores (using richer features) Anonymous Message Scores Author A 5 messages Author B 1 message
Principal Component Analysis: Classification • Kernel Functions • Kernel functions can be used with PCA in a manner similar to SVM. • This example shows how a polynomial kernel can be used. • Polynomial PCA has been used a lot for image recognition. Kernel Function
Principal Component Analysis: Applications Writeprint Illustration
Principal Component Analysis: Applications Various Writeprint Views Standard View Density View Temporal View Multidimensional View
Principal Component Analysis: Applications Writeprint Category Prints All Features Letter Freq. Content Spec. Punctuation Word Length X Y X Y X Y All Features 0 1 3 1 0 Content Spec. 2 1 0 1 3 1 0 1 Letter Freq. 1 0 1 3 1 0 1 Writeprints are made using all features, while individual categories can also be used for identification or analysis purposes (category prints). X Y X Y 1 0 1 3 1 0 Punctuation 3 1 3 1 0 Word Length
Principal Component Analysis: Applications Category Print Views This author has a fairly consistent set of discussion topics, based on the tighter pattern (less variation of content specific features).
Principal Component Analysis: Applications Special Char. Writeprints Special Char. Eigenvectors Author A Interpreting Writeprints Author B Author C Author D
Principal Component Analysis: Applications Author Writeprints Anonymous Messages Author A 10 messages Author B 10 messages
Principal Component Analysis: Applications Experimental Results: Stylometric Identification Task Writeprint outperformed SVM and Ensemble SVM Classification Accuracy (%)
Principal Component Analysis: Applications The Enron Case Author A • Temporal Writeprint views of the two authors across all features (lexical, syntactic, structural, content-specific, n-grams, etc.). • Each circle denotes a text window that is colored according to the point in time at which it occurred. • The bright green points represent text windows from emails written after the scandal had broken out while the red points represent text windows from before. • Author B has greater overall feature variation, attributable to a distinct difference in the spatial location of points prior to the scandal as opposed to afterwards. • In contrast, Author A has no such difference, with his newer (green) text points placed directly on top of his older (redder) ones. • Consequently, Author B has had a profound change with respect to the text in his emails while there doesn’t appear to be any major changes for Author A. Author B