A Comparative Study of Kernel Methods for Classification Applications

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003

Introduction • Support Vector Machines • Text classification • Protein classification • Various kernels • Standard kernels • Linear kernels, polynomial kernels, RBF kernels • Other application-oriented kernels • Latent semantic kernels • Fisher-kernels, string kernels and etc • Problem Definition • Rare-class problem (unbalanced data) • Noisy data problem • Multi-label problem

Text Classification • Kernels • Linear kernels • Latent semantic kernels • Problem Focus • Rare-class problem • Multi-label problem • Noisy data problem • Dataset • Reuters21578 dataset

Data Analysis: Reuters-21578 • The corpus consists of 7769 document in training and 3019 document in testing mapped to 90 categories • Rare-class problem (Unbalanced Data)

Data Analysis: Reuters-21578 • Multi-label problem • Definition: one document belongs more than one categories • The averaged doc-to-category ratio is 1.271 for training set

Methodology and Schedule • Analyze the properties of the application data and propose conjectures on the possible behaviors • Projection from high-dimensional data to low-dimensional • Singular Value Decomposition (SVD) • Reduced-Rank Linear Discriminative Analysis (LDA) • Propose hypothesis • Work on synthetic datasets to testify hypothesis • Generate low-dimensional synthetic data with similar properties of real data • Testify hypothesis • Map from synthetic data to real application data

Case 1: Multi-label ProblemReuters-21578 • Conceptually two cases: (1) whole v.s. part • Wheat v.s. Grain

Case 1: Multi-label ProblemSynthetic Data • Data generation • Gaussian mixture models • 200 data points in total • Class 1: red • Class 2: green • Class 1 & 2: blue • Hypothesis • Linear kernel: predict everything as class 1 • LSI kernel: hard to say, maybe similar as linear kernel? • RBF: fit the data better than linear kernel?

Case 1: Multi-label ProblemResults on Synthetic Data • Linear kernel results: • Class 1: Prec: 0.985000 Rec: 1.000000 F1: 0.992443 • Class 2: Prec: 0 Rec: 0 F1: 0 • Class 1& 2: Rec: 0, Prec: 0 • Discussion • The results on Class 1&2 depends on the proportion mp • mp = # of multi-label examples/ # of training examples • If mp > 0.5, then Rec = 1.00, Prec = mp • If mp < 0.5, then Rec = 0, Prec = 0

Case 1: Multi-label ProblemResults on Synthetic Data • LSI kernel results: • Exactly the same as linear kernel • Class 1: Prec: 0.985000 Rec: 1.000000 F1: 0.992443 • Class 2: Prec: 0 Rec: 0 F1: 0 • Class 1& 2: Rec: 0, Prec: 0 • Discussion • It seems that LSI performs similarly as the linear kernel • In the real application, LSI might have different behaviors

Case 1: Multi-label ProblemResults on Synthetic Data • RBF kernel results • Class1: Prec: 0.985000 Rec: 1.000000 F1: 0.992443 • Class 2 :Prec: 0.854167 Rec: 0.512500 F1: 0.640625 • Class 1 & 2: Prec: 0.791667 Rec: 0.493506 • Discussion • RBF kernel fits the data very well

Case 1: Multi-label ProblemReuters-21578 • Conceptually two cases: (2) Share concepts • Wheat v.s. Soy-bean

Case 1: Multi-label ProblemSynthetic Data • Data generation • Gaussian mixture models • 200 data points in total • Class 1: red • Class 2: green • Class 1 & 2: blue • Hypothesis • Linear kernel: might work well for this case? • LSI kernel:also might work for this case? • RBF: might overfit?

Case 1: Multi-label ProblemResults on Synthetic Data • Linear kernel results: • Class 1: Prec: 0.918699 Rec: 0.869231 F1: 0.893281 • Class 2: Prec: 0.938462 Rec: 0.938462 F1: 0.938462 • Class 1& 2: Rec: 0.300000, Prec: 0.391304

Case 1: Multi-label ProblemResults on Synthetic Data • LSI kernel results: • Class 1: Prec: 0.928000 Rec: 0.892308 F1: 0.909804 • Class 2: Prec: 0.938462 Rec: 0.938462 F1: 0.938462 • Class 1& 2: Rec: 0.366667, Prec: 0.440000

Case 1: Multi-label ProblemResults on Synthetic Data • RBF kernel results: • Class 1: Prec: 0.934426 Rec: 0.876923 F1: 0.904762 • Class 2: Prec: 0.938462 Rec: 0.938462 F1: 0.938462 • Class 1& 2: Rec: 0.333333, Prec: 0.454545

Case 1: Multi-label ProblemResults on Synthetic Data • Discussion on results • Linear kernel performs reasonably well • LSI kernel gains more than linear kernel by separating the data in the right direction • RBF kernel tends to fit the data

Case 2: Rare class problemReuters-21578 • CPU v.s. Wheat

Case 2: Rare class problemSynthetic Data • Data generation • Gaussian mixture models • 103 data points in total • Class 1: red • Class 2: green • Hypothesis • Both linear kernel and LSI kernel seem to perform reasonably well • RBF might overfit?

Case 2: Rare class problemSynthetic Data Results • Results • Question: where is the problem? Linear LSI RBF

Case 2: Rare class problem Synthetic Data Results • Discussion • The problem lies in the SVM classifier instead of the kernel. • SVM tries to maximize the margin. • Solution • Set the cost-function in SVM classifier • Tune threshold instead of using the default 0 • Up-sampling, down-sampling, and ensemble approaches • The analysis for different kernels will be difficult

Case 3: Noisy data problemSynthetic Data • Data generation • Gaussian mixture models • 200 data points in total • Class 1: red • Class 2: green • Noise data: blue • Hypothesis • Linear kernel tends to be robust to noise • Little change for LSI kernel since the transformation is independent of the class labels • RBF might overfit?

Case 3: Noisy data problemSynthetic Data Results • Results • Linear kernel and LSI kernel are robust to the noise • RBF kernel tends to overfit Linear LSI RBF

Summary • Multi-class problem • Case 1:whole v.s. part • Linear and LSI depends on the data distribution, but can work a lot better if we know the category hierarchy • RBF seems to work better • Case 2: share concepts • LSI works a little bit better than linear kernel • Rare class problem • Problem lies in the SVM classifier, more serious on the thresholding problem • Noisy data • Linear kernel and LSI are robust to noise • RBF might overfit

Next step • Work on the real application dataset and testify the hypothesis • Reuters-21578 • A subset of RCV-1 • Focus more on the multi-label problem

Protein Family Classification • Kernel selection • Fisher-kernels • String kernels • Problem Focus • Rare-class problem • Noisy data problem • Dataset • GPCR family classification dataset

Data Analysis: GPCR family classification • The dataset consists of 1356 sequences by 13 classes, one sequence has one and only one label. • Rare-class problem (Unbalanced Data)

Kernel methods revisted • Fisher-kernel • Build a HMM model for each family • Compute the fisher scores for each parameter in the HMM • Use scores as features and predict by SVM with RBF kernel • String kernel • K-spectrum Kernel: • all possible contiguous subsequences of length k (k = 3, 4) • Similar as using N-gram • Mismatch string kernel • An extension of string kernel that allows mismatch • K = 5, 6

Proposed Kernel-PSA kernel • Intuition • The kernel defines the similarity between two sequences in the Hilbert feature space • The similarity between two sequences is one of the basic problem in bioinformatics and well-studied. • Proposed kernel • K(x,y) is the pairwise sequence alignment scores

Experimental results and on-going work • Experimental results • Two-way cross-validation • Pairwise sequence alignment using ClustalW • An accuracy of 0.9550 for the GPCR family classification over 13 classes and 0.9834 over Class ABCDE • SVM converges very fast • On-going work • Proof of semi-definite • Connection between string kernel and fisher kernel • Experiments on other datasets

A Comparative Study of Kernel Methods for Classification Applications

A Comparative Study of Kernel Methods for Classification Applications

Presentation Transcript

Linear Methods for Classification

Overview of Kernel Methods

Kernel Methods: Basics

Online Multiple Kernel Classification

Kernel Methods

Kernel Methods

Kernel methods

Classification ( SVMs / Kernel method)

Kernel Methods for Relation Extraction

Machine Learning for Protein Classification: Kernel Methods

A Comparative Study of Kernel Methods for Classification Applications

Kernel synchronization methods

Kernel – Based Methods

Kernel Methods

Comparative Study of Three Methods of Calculating Atomic Charge in a Molecule

Kernel Methods

Comparative Study of Two Methods for Olfactory Measurement

A Comparative UI Study for MyPlace

Kernel Methods

Kernel methods - overview

Linear Methods for Classification

Kernel Methods for Classification From Theory to Practice