300 likes | 584 Views
A Comparative Study of Kernel Methods for Classification Applications. Yan Liu Oct 21, 2003. Introduction. Support Vector Machines Text classification Protein classification Various kernels Standard kernels Linear kernels, polynomial kernels, RBF kernels
E N D
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003
Introduction • Support Vector Machines • Text classification • Protein classification • Various kernels • Standard kernels • Linear kernels, polynomial kernels, RBF kernels • Other application-oriented kernels • Latent semantic kernels • Fisher-kernels, string kernels and etc • Problem Definition • Rare-class problem (unbalanced data) • Noisy data problem • Multi-label problem
Text Classification • Kernels • Linear kernels • Latent semantic kernels • Problem Focus • Rare-class problem • Multi-label problem • Noisy data problem • Dataset • Reuters21578 dataset
Data Analysis: Reuters-21578 • The corpus consists of 7769 document in training and 3019 document in testing mapped to 90 categories • Rare-class problem (Unbalanced Data)
Data Analysis: Reuters-21578 • Multi-label problem • Definition: one document belongs more than one categories • The averaged doc-to-category ratio is 1.271 for training set
Methodology and Schedule • Analyze the properties of the application data and propose conjectures on the possible behaviors • Projection from high-dimensional data to low-dimensional • Singular Value Decomposition (SVD) • Reduced-Rank Linear Discriminative Analysis (LDA) • Propose hypothesis • Work on synthetic datasets to testify hypothesis • Generate low-dimensional synthetic data with similar properties of real data • Testify hypothesis • Map from synthetic data to real application data
Case 1: Multi-label ProblemReuters-21578 • Conceptually two cases: (1) whole v.s. part • Wheat v.s. Grain
Case 1: Multi-label ProblemSynthetic Data • Data generation • Gaussian mixture models • 200 data points in total • Class 1: red • Class 2: green • Class 1 & 2: blue • Hypothesis • Linear kernel: predict everything as class 1 • LSI kernel: hard to say, maybe similar as linear kernel? • RBF: fit the data better than linear kernel?
Case 1: Multi-label ProblemResults on Synthetic Data • Linear kernel results: • Class 1: Prec: 0.985000 Rec: 1.000000 F1: 0.992443 • Class 2: Prec: 0 Rec: 0 F1: 0 • Class 1& 2: Rec: 0, Prec: 0 • Discussion • The results on Class 1&2 depends on the proportion mp • mp = # of multi-label examples/ # of training examples • If mp > 0.5, then Rec = 1.00, Prec = mp • If mp < 0.5, then Rec = 0, Prec = 0
Case 1: Multi-label ProblemResults on Synthetic Data • LSI kernel results: • Exactly the same as linear kernel • Class 1: Prec: 0.985000 Rec: 1.000000 F1: 0.992443 • Class 2: Prec: 0 Rec: 0 F1: 0 • Class 1& 2: Rec: 0, Prec: 0 • Discussion • It seems that LSI performs similarly as the linear kernel • In the real application, LSI might have different behaviors
Case 1: Multi-label ProblemResults on Synthetic Data • RBF kernel results • Class1: Prec: 0.985000 Rec: 1.000000 F1: 0.992443 • Class 2 :Prec: 0.854167 Rec: 0.512500 F1: 0.640625 • Class 1 & 2: Prec: 0.791667 Rec: 0.493506 • Discussion • RBF kernel fits the data very well
Case 1: Multi-label ProblemReuters-21578 • Conceptually two cases: (2) Share concepts • Wheat v.s. Soy-bean
Case 1: Multi-label ProblemSynthetic Data • Data generation • Gaussian mixture models • 200 data points in total • Class 1: red • Class 2: green • Class 1 & 2: blue • Hypothesis • Linear kernel: might work well for this case? • LSI kernel:also might work for this case? • RBF: might overfit?
Case 1: Multi-label ProblemResults on Synthetic Data • Linear kernel results: • Class 1: Prec: 0.918699 Rec: 0.869231 F1: 0.893281 • Class 2: Prec: 0.938462 Rec: 0.938462 F1: 0.938462 • Class 1& 2: Rec: 0.300000, Prec: 0.391304
Case 1: Multi-label ProblemResults on Synthetic Data • LSI kernel results: • Class 1: Prec: 0.928000 Rec: 0.892308 F1: 0.909804 • Class 2: Prec: 0.938462 Rec: 0.938462 F1: 0.938462 • Class 1& 2: Rec: 0.366667, Prec: 0.440000
Case 1: Multi-label ProblemResults on Synthetic Data • RBF kernel results: • Class 1: Prec: 0.934426 Rec: 0.876923 F1: 0.904762 • Class 2: Prec: 0.938462 Rec: 0.938462 F1: 0.938462 • Class 1& 2: Rec: 0.333333, Prec: 0.454545
Case 1: Multi-label ProblemResults on Synthetic Data • Discussion on results • Linear kernel performs reasonably well • LSI kernel gains more than linear kernel by separating the data in the right direction • RBF kernel tends to fit the data
Case 2: Rare class problemReuters-21578 • CPU v.s. Wheat
Case 2: Rare class problemSynthetic Data • Data generation • Gaussian mixture models • 103 data points in total • Class 1: red • Class 2: green • Hypothesis • Both linear kernel and LSI kernel seem to perform reasonably well • RBF might overfit?
Case 2: Rare class problemSynthetic Data Results • Results • Question: where is the problem? Linear LSI RBF
Case 2: Rare class problem Synthetic Data Results • Discussion • The problem lies in the SVM classifier instead of the kernel. • SVM tries to maximize the margin. • Solution • Set the cost-function in SVM classifier • Tune threshold instead of using the default 0 • Up-sampling, down-sampling, and ensemble approaches • The analysis for different kernels will be difficult
Case 3: Noisy data problemSynthetic Data • Data generation • Gaussian mixture models • 200 data points in total • Class 1: red • Class 2: green • Noise data: blue • Hypothesis • Linear kernel tends to be robust to noise • Little change for LSI kernel since the transformation is independent of the class labels • RBF might overfit?
Case 3: Noisy data problemSynthetic Data Results • Results • Linear kernel and LSI kernel are robust to the noise • RBF kernel tends to overfit Linear LSI RBF
Summary • Multi-class problem • Case 1:whole v.s. part • Linear and LSI depends on the data distribution, but can work a lot better if we know the category hierarchy • RBF seems to work better • Case 2: share concepts • LSI works a little bit better than linear kernel • Rare class problem • Problem lies in the SVM classifier, more serious on the thresholding problem • Noisy data • Linear kernel and LSI are robust to noise • RBF might overfit
Next step • Work on the real application dataset and testify the hypothesis • Reuters-21578 • A subset of RCV-1 • Focus more on the multi-label problem
Protein Family Classification • Kernel selection • Fisher-kernels • String kernels • Problem Focus • Rare-class problem • Noisy data problem • Dataset • GPCR family classification dataset
Data Analysis: GPCR family classification • The dataset consists of 1356 sequences by 13 classes, one sequence has one and only one label. • Rare-class problem (Unbalanced Data)
Kernel methods revisted • Fisher-kernel • Build a HMM model for each family • Compute the fisher scores for each parameter in the HMM • Use scores as features and predict by SVM with RBF kernel • String kernel • K-spectrum Kernel: • all possible contiguous subsequences of length k (k = 3, 4) • Similar as using N-gram • Mismatch string kernel • An extension of string kernel that allows mismatch • K = 5, 6
Proposed Kernel-PSA kernel • Intuition • The kernel defines the similarity between two sequences in the Hilbert feature space • The similarity between two sequences is one of the basic problem in bioinformatics and well-studied. • Proposed kernel • K(x,y) is the pairwise sequence alignment scores
Experimental results and on-going work • Experimental results • Two-way cross-validation • Pairwise sequence alignment using ClustalW • An accuracy of 0.9550 for the GPCR family classification over 13 classes and 0.9834 over Class ABCDE • SVM converges very fast • On-going work • Proof of semi-definite • Connection between string kernel and fisher kernel • Experiments on other datasets