1 / 26

A Kernel Approach for Learning From Almost Orthogonal Pattern *

A Kernel Approach for Learning From Almost Orthogonal Pattern *. CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin. * B. Scholkopf et al ., Proc. 13 th ECML , Aug 19-23, 2002, pp. 511-528. Presentation Outline. Introduction Motivation

Download Presentation

A Kernel Approach for Learning From Almost Orthogonal Pattern *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Kernel Approach for Learning From Almost Orthogonal Pattern* CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf et al., Proc. 13th ECML, Aug 19-23, 2002, pp. 511-528.

  2. Presentation Outline • Introduction • Motivation • A Brief review of SVM for linearly separable patterns • Kernel approach for SVM • Empirical kernel map • Problem: almost orthogonal patterns in feature space • An example • Situations leading to almost orthogonal patterns • Method to reduce large diagonals of Gram matrix • Gram matrix transformation • An approximate approach based on statistics • Experiments • Artificial data (String classification, Microarray data with noise, Hidden variable problem) • Real data (Thrombin binding, Lymphoma classification, Protein family classification) • Conclusions • Comments

  3. Introduction

  4. Motivation • Support vector machine (SVM) • Powerful method for classification (or regression) with high accuracy comparable to neural network • Exploit of kernel function for pattern separation in high dimensional space • The information of training data for SVM is stored in the Gram matrix (kernel matrix) • The problem: • SVM doesn’t perform well if Gram matrix has large diagonal values

  5. ─ + + + + ─ ─ depends on closest points + + + ─ ─ ─ + ─ ─ margin A Brief Review of SVM For linearly separable patterns: To maximize the margin: Minimize: Constraints:

  6. Minimize: Constraints: Kernel Approach for SVM (1/3) • For linearly non-separable patterns • Nonlinear mapping function (x)H: mapping the patterns to new feature space H of higher dimension • For example: the XOR problem • SVM in the new feature space: • The kernel trick: • Solving the above minimization problem requires: 1) Explicit form of  2) Inner product in high dimensional space H • Simplification by wise selection of kernel functions with property: k(xi, xj) = (xi)  (xj)

  7. Minimize: Constraints: Kernel Approach for SVM (2/3) • Transform the problem with kernel method • Expand w in the new feature space: w = ai(xi) = [(x)]awhere [(x)]=[(x1), (x2), …, (xm)], and a=[a1, a2, … am]T • Gram matrix: K=[Kij], where Kij = (xi)  (xj) = k(xi, xj) (symmetric !) • The (squared) objective function:||w||2 = aT[(x)]T[(x)]a = aTKa (sufficient condition for existence of optimal solution: K is positive definite) • The constraints:yi{wT(xi) + b} = yi{aT[(x)]T(xi) + b} = yi{aTKi + b}  1, where Ki is the ith column of K.

  8. Where: a and b are optimal solution based on training data, and m is the number if training data Kernel Approach for SVM (3/3) • To predict new data with a trained SVM • The explicit form of k(xi, xj) is required for prediction of new data

  9. Minimize: Constraints: Empirical Kernel Mapping • Assumption: m (the number if instances) is a sufficient high dimension of the new feature space. i.e. the patterns will be linearly separable in m-dimension space (Rm) • Empirical kernel map: m(xi)= [k(xi,x1), k(xi,x2), …, k(xi,xm)]T =Ki • The SVM in Rm • The new Gram matrix Km associated withm(x): Km=[Kmij], where Kmij = m(xi)  m(xj) = Ki  Kj = KiTKj, i.e.Km = KTK = KKT • Advantage of empirical kernel map: Km is positive definite • Km= KKT = (UTDU) (UTDU)T= UTD2U(K is symmetric, U is unitary matrix, D is diagonal) • Satisfied the sufficient condition of above minimization problem

  10. The Problem: Almost Orthogonal Patterns in the Feature Space Result in Poor Performance

  11. An Example of Almost Orthogonal Patterns • The Gram matrix with linear kernel k(xi, xj) = xi xj • The training dataset with almost orthogonal patterns Large Diagonals • w is the solution with standard SVM • Observation: each large entry in w is corresponding to a column in X with only one large entry: w becomes a lookup table, the SVM won’t generalize well • A better solution:

  12. Situations Leading to Almost Orthogonal Patterns • Sparsity of the patterns in the new feature space, e.g. • x = [ 0, 0, 0, 1, 0, 0, 1, 0]T • Y = [ 0, 1, 1, 0, 0, 0 , 0, 0]T • x  x  y  y >> x  y (large diagonals in Gram matrix) • Some selection of kernel functions may result in sparsity in the new feature space • String kernel (Watkins 2000, et al) • Polynomial kernel, k(xi, xj) = (xixj)d, with large order d • If xi  xi > xi  xj , for ij, then • k(xi, xi) >> k(xi, xj), for even moderately large d, due to the exponential function.

  13. Methods to Reduce the Large Diagonals of Gram Matrices

  14. Gram Matrix Transformation (1/2) • For symmetric, positive definite Gram matrix K (or Km), • K = UTDU U is unitary matrix, D is diagonal matrix • Define f(K) = UTf(D)U, andf(D)ii = f(Dii) i.e., the function f operates on the eigenvalues i of K • f(K) should preserve positive definition of the Gram matrix • A sample procedure for Gram matrix transformation • (Optional) Compute the positive definite matrix A = sqrt(K) • Suppress the large diagonals of A, and obtain a symmetric A’ i.e. transform the eigenvalues of A:[min, max]  [f(min ), f(max )] • Compute the positive definite matrix K’=(A’)2

  15. k(xi,xj)= (xi)  (xj) (x) K Implicit transformation f(K) ’(x) K’ k’(xi,xj) = ’(xi)  ’(xj) a’ and b’ from the portion of K’ corresponding to the training data K’=f(K) If xi has been used in calculating K’,the prediction on xi can simply use K’i i= 1, 2,…m+n, where m is the number if training dataand n is the number of testing data Gram Matrix Transformation (2/2) • Effect of matrix transformation • The explicit form of new kernel function k’ is not available • k’ is required when the trained SVM is used to predict the testing data • A solution: include all test data into K before the matrix transformation K->K’i.e. the testing data has to be known in training time

  16. An Approximate Approach based on Statistics • The empirical kernel map m+n(x) should be used to calculate the Gram matrix • Assuming the dataset size r is large • Therefore, the SVM can be simply trained with the empirical map on the training set, m(x), instead of m+n(x)

  17. Experiment Results

  18. Artificial Data (1/3) • String classification • String kernel function (Watkins 2000, et al) • Sub-polynomial kernel k(x,y) = [(x)  (y)]P, 0<P<1 for sufficiently small P, the large diagonals of K can be suppressed • 50 strings (25 for training, and 25 for testing), 20 trials

  19. Artificial Data (2/3) • Microarray data with noise (Alon et al, 1999) • 62 instance (22 positive, 44 negative), 2000 features in original data • 10000 noise features were added (1% to be non-zero in probability) Error rate for SVM without noise addition is: 0.180.15

  20. Artificial Data (3/3) • Hidden variable problem • 10 hidden variables (attributes), 10 additional attributes which are nonlinear functions of the 10 hidden variables • Original kernel is polynomial kernel of order 4

  21. Real Data (1/3) • Thrombin binding problem • 1909 instances, 139,351 binary features • 0.68% entries are non-zero • 8-fold cross validation

  22. Real Data (2/3) • Lymphoma classification (Alizadeh et al, 2000) • 96 samples, 4026 features • 10-fold cross validation • Improved results observed compared with previous work (Weston, 2001)

  23. Real Data (3/3) • Protein family classification (Murzin et al, 1995) • Small positive set, large negative set Receiver operating characteristic 1: best score 0: worst score Rate of false positive

  24. Conclusions • Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed • The common situation that sparse vectors leading to large diagonals was identified and discussed • A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases • Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices

  25. Comments • Strong points: • The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals • Experiments are extensive • Weak points: • The application of Gram matrix transformation may be severely restricted in forecasting or other applications in which the testing data is not know in training time • The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments • The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied

  26. End!

More Related